
Why I debug my RAG pipeline stage by stage, not end to end
The problem with end-to-end RAG eval I had a working document retrieval pipeline. Fixed-size chunking, TF-IDF embeddings, FAISS index. Recall@10 was 0.82 on SciFact. Good enough. Then I made one change: I swapped fixed-size chunking for sentence-based chunking. Recall dropped to 0.68. My first instinct was to roll back. But I wanted to understand why . End-to-end eval only told me "retrieval is worse." It couldn't tell me which stage was responsible. The debugging approach I restructured the pipeline so each stage can be evaluated independently. The pipeline is expressed as a string feature chain: from mloda.user import mlodaAPI , PluginCollector # The full pipeline: each __ is a stage boundary results = mlodaAPI . run_all ( features = [ " docs__pii_redacted__chunked__deduped__embedded " ], ... ) Stop at chunking? "docs__pii_redacted__chunked" . Skip dedup? "docs__pii_redacted__chunked__embedded" . Add evaluation? "docs__pii_redacted__chunked__deduped__embedded__evaluation" . Each stag
Continue reading on Dev.to
Opens in a new tab