
My RAG Pipeline Took an Hour. Here's How I Got It Down to 30 Seconds.
A content ingestion job used to take over an hour. Now it finishes in 30 seconds. No change in hardware, just better utilization of what is already there, a smarter queue system, and hours debugging how CUDA and multiprocessing works. Here’s how I got there. I was creating a RAG application with Django, and Milvus as my vector database. I initially created a very simple way to ingest documents. Create a celery task → Fetch the page → Chunk the page → Create vector embeddings → Upload them to Milvus. This worked great. Nothing wrong with it other than the fact that it was slow. Ingesting the entire Django docs took over an hour. Can we do better? So I run everything on my person computer. I have a CUDA GPU (Nvidia 4070 Super), so I wanted to see if that can speed up the process. I changed the embedding model to use the GPU, tweaked some of the docker images and got the GPU to start creating embeddings on my test code. def get_embedding_model ( force_cpu : bool = False ): global embeddin
Continue reading on Dev.to Python
Opens in a new tab




