
Building a Production RAG Pipeline That Actually Works: Lessons from DocExtract
The Architecture (and Why It's 3 Services, Not 1) DocExtract is split into three services: an API, a worker, and a frontend. User uploads PDF → API validates and enqueues job (ARQ/Redis) → Worker picks up job asynchronously → chunk + embed → pgvector store → BM25 index built in memory on retrieval → API streams SSE progress to frontend → User queries with natural language → hybrid retrieval → Claude generates answer with citations Why not one FastAPI service? Because document processing is slow (2-8 seconds per page), and you don't want your API workers blocked. The ARQ queue decouples upload from processing, which lets you scale workers independently and gives you a natural retry boundary. The async split also means you can add real-time progress streaming (SSE) to the frontend without any threading complexity - the worker updates job state in Redis, the API polls it, and the frontend gets a 12-step progress bar that actually reflects what's happening. The full system has 1,060 tests
Continue reading on Dev.to Python
Opens in a new tab



