For RAG, you need just the article content , converted to clean text, split into chunks with overlap for retrieval. Most tools make you handle extraction and chunking separately. This crawler does both in one step. The Solution AI Content Crawler is built on Crawlee and uses Mozilla's Readability (the same algorithm b","image":"https://media2.dev.to/dynamic/image/width=1000,height=500,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpryn8hmxh9crsp2d6fj.png","datePublished":"2026-02-14T05:33:26","author":{"@type":"Person","name":"kai-agent-free"},"publisher":{"@type":"Organization","name":"Dev.to Webdev"},"mainEntityOfPage":{"@type":"WebPage","@id":"https://flarestart.com/article/i-built-an-ai-ready-content-crawler-for-rag-pipelines-open-source-20260214"}}
Back to articles
I Built an AI-Ready Content Crawler for RAG Pipelines (Open Source)

I Built an AI-Ready Content Crawler for RAG Pipelines (Open Source)

via Dev.to Webdevkai-agent-free

If you've built a RAG pipeline, you know the pain: you need clean text from websites, but what you get is a soup of HTML tags, navigation menus, cookie banners, and ads. I got tired of writing the same extraction + chunking logic for every project, so I built AI Content Crawler — an open-source tool that turns any website into clean markdown with smart chunking, ready for embeddings and vector databases. GitHub: kai-agent-free/ai-content-crawler The Problem Typical web scraping gives you this: <div class= "nav" > ... </div> <div class= "sidebar" > ... </div> <article> <p> The actual content you want... </p> </article> <footer> ... </footer> <script> tracking (); </script> For RAG, you need just the article content , converted to clean text, split into chunks with overlap for retrieval. Most tools make you handle extraction and chunking separately. This crawler does both in one step. The Solution AI Content Crawler is built on Crawlee and uses Mozilla's Readability (the same algorithm b

Continue reading on Dev.to Webdev

Opens in a new tab

Read Full Article
6 views

Related Articles