How I Built a Production RAG Pipeline with FastAPI, pgvector and Cross-Encoder Reranking

I built a production RAG engine that combines hybrid search (pgvector + BM25), cross-encoder reranking , MMR diversity , semantic caching and automatic language detection — all on top of async FastAPI and PostgreSQL. This article covers the real architecture, technical decisions, and key code. Why Another RAG Article Most RAG tutorials stop at "embed → cosine search → prompt". That works for a demo, but in production you'll run into: Queries in Spanish that match English chunks (or vice versa) The top 10 most similar chunks come from the same document Semantic search fails with proper nouns or exact codes Repeated answers burn tokens unnecessarily No way to know if the answer is actually reliable This article shows how I solved each of these problems in a real multi-tenant system. The Stack Component Technology Backend Python 3.12 + FastAPI (100% async) Database PostgreSQL 16 + pgvector + tsvector Embeddings paraphrase-multilingual-MiniLM-L12-v2 (384d) Reranker cross-encoder/ms-marco-M

How I Built a Production RAG Pipeline with FastAPI, pgvector and Cross-Encoder Reranking

Related Articles

Why Skill-Based Learning is Quietly Becoming the Real Standard of Education

Context: a vital pattern nobody talks about

Clean Code Won’t Save You in Production

The Skills That Make Great Developers Stand Out

The state file: how autonomous agents survive context resets

Related Articles

How-To
Why Skill-Based Learning is Quietly Becoming the Real Standard of Education
Medium Programming • 3h ago

How-To
Context: a vital pattern nobody talks about
Medium Programming • 4h ago

How-To
Clean Code Won’t Save You in Production
Medium Programming • 4h ago

How-To
The Skills That Make Great Developers Stand Out
Medium Programming • 5h ago

How-To
The state file: how autonomous agents survive context resets
Dev.to • 7h ago