
The Curse of Context Window
TL;DR : Large-document extraction with LLMs fails less from “bad reasoning” and more from hard output limits. JSON structured outputs waste tokens on repeated keys and still truncate on big PDFs. Switching to CSV reduces overhead but doesn’t fix truncation—your output can still cut off silently. The reliable fix is chunking the document into page batches, processing chunks asynchronously with strict concurrency limits (semaphores), and stitching results back in order; run summarization as a separate pass. I was working on a problem to extract structured information from large documents on very lean infrastructure. The input documents were either PDF, CSV or Excel. For CSV and Excel, extraction was pretty straightforward but PDFs posed a separate challenge of their own. The PDFs we ingested were mostly a mix of digital and scanned. The moment scanned PDFs come in to the picture, one can imagine the various edge cases that come with them - image quality, noise, orientation, spillovers et
Continue reading on Dev.to
Opens in a new tab




