
Sitemap Parser That Auto-Discovers from robots.txt
Most websites have sitemaps, but finding them can be tricky. Here's a parser that auto-discovers. Discovery Logic Check robots.txt for Sitemap: directive Try common paths: /sitemap.xml , /sitemap_index.xml Parse XML with cheerio xmlMode Handle sitemap indexes recursively Recursive Parsing Sitemap indexes contain links to child sitemaps: <sitemapindex> <sitemap><loc> https://site.com/sitemap-1.xml </loc></sitemap> <sitemap><loc> https://site.com/sitemap-2.xml </loc></sitemap> </sitemapindex> Parse each child, aggregate all URLs. Output { "url" : "https://stripe.com/sitemap.xml" , "lastmod" : "2026-03-15" , "changefreq" : "weekly" , "priority" : 0.8 } Stripe.com has 4,817 URLs across 6 child sitemaps. I built a Sitemap Parser on Apify — search knotless_cadence sitemap .
Continue reading on Dev.to Webdev
Opens in a new tab



