
Extract Clean Text from Any Webpage for RAG Pipelines
Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML. Here's a simple approach using CheerioCrawler: // Remove noise $ ( " script, style, nav, footer, header, aside, .ad, noscript " ). remove (); // Get main content let text = $ ( " article, [role=main], main, .content " ). first (). text (); if ( ! text || text . length < 100 ) text = $ ( " body " ). text (); // Clean whitespace text = text . replace ( / \s +/g , " " ). trim (); Why Not Just Use body.text() ? Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content. The Priority Order <article> tag — most semantic, usually contains the main content [role="main"] — ARIA landmark <main> — HTML5 semantic element .content, .post-content — common CSS classes <body> — fallback Output { "url" : "https://example.com/blog/post" , "title" : "The Blog Post Title" , "text" : "Clean extracted text..." , "wordCount" : 1450 , "characterCount" : 870
Continue reading on Dev.to Webdev
Opens in a new tab



