
robots.txt Reveals More Than You Think — Hidden Paths, APIs, and AI Policies
Before scraping any website, check robots.txt . It tells you exactly what you can and can't crawl — and reveals hidden information about the site. https://example.com/robots.txt What robots.txt Reveals Disallowed paths = hidden content. When a site blocks /admin/ , /staging/ , /api/v2/ — they're confirming these paths exist. Sitemap location. Most robots.txt files include Sitemap: https://example.com/sitemap.xml — your complete URL index. Crawl-delay. How fast the site wants bots to go. Respect this. Bot-specific rules. Some sites block GPTBot, Google-Extended, or CCBot specifically — revealing their AI-related policies. Example User - agent : * Disallow : / admin / Disallow : / api / internal / Crawl - delay : 2 Sitemap : https :// example . com / sitemap . xml User - agent : GPTBot Disallow : / This tells you: there's an admin panel, an internal API, they want 2s between requests, and they block AI crawlers from all content. Tools Robots.txt Analyzer — parse and analyze any robots.tx
Continue reading on Dev.to Webdev
Opens in a new tab




