
What Changed When Our Research Pipeline Hit a PDF Wall (Production Case Study)
On March 12, 2025, the PDF ingestion pipeline for a document-heavy product crossed a hard limit: nightly batches that used to finish in three hours were now spilling into the business day, causing timeouts, missed SLAs, and angry support tickets. The project was a live production feature used by legal teams to search across contracts, scanned exhibits, and technical manuals. The stakes were clear-lost user trust and a blocked roadmap that depended on faster, more reliable document understanding. Discovery We traced the outage to two linked problems: a brittle retrieval layer that failed on scanned PDFs with complex layouts, and an orchestration scheme that treated every file as “same weight” during processing. The existing pipeline used an off-the-shelf OCR + embedding flow that worked for plain text, but degraded fast on mixed-layout documents (tables, figures, two-column scans). The result was high false-negative rates for entity extraction and a queue backlog. What we needed was a s
Continue reading on Dev.to
Opens in a new tab


