Back to articles
How I Built a Production RAG Pipeline with FastAPI, pgvector and Cross-Encoder Reranking

How I Built a Production RAG Pipeline with FastAPI, pgvector and Cross-Encoder Reranking

via Dev.to PythonMartin Palopoli

I built a production RAG engine that combines hybrid search (pgvector + BM25), cross-encoder reranking , MMR diversity , semantic caching and automatic language detection — all on top of async FastAPI and PostgreSQL. This article covers the real architecture, technical decisions, and key code. Why Another RAG Article Most RAG tutorials stop at "embed → cosine search → prompt". That works for a demo, but in production you'll run into: Queries in Spanish that match English chunks (or vice versa) The top 10 most similar chunks come from the same document Semantic search fails with proper nouns or exact codes Repeated answers burn tokens unnecessarily No way to know if the answer is actually reliable This article shows how I solved each of these problems in a real multi-tenant system. The Stack Component Technology Backend Python 3.12 + FastAPI (100% async) Database PostgreSQL 16 + pgvector + tsvector Embeddings paraphrase-multilingual-MiniLM-L12-v2 (384d) Reranker cross-encoder/ms-marco-M

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article
2 views

Related Articles