robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website

via Dev.to Webdevcarlosortet2h ago

You configure robots.txt like this: User - agent : GPTBot Disallow : / User - agent : CCBot Disallow : / User - agent : anthropic - ai Disallow : / User - agent : PerplexityBot Disallow : / User - agent : * Disallow : / You enable Cloudflare Bot Management. You set up Akamai. Maybe even a server-side paywall. And then you query ChatGPT about your product and it cites your website as a source. How? I work on GEO (Generative Engine Optimization) projects where we audit how LLMs represent brands. We routinely analyze thousands of prompt-response pairs. Across multiple projects, we consistently find that 10–20% of LLM responses cite the brand's own website as a source — even when every known bot is blocked. Here are the 8 technical vectors we documented, with academic sources and industry data. 1. Historical crawl data (Common Crawl) This is the biggest one and the least understood. Common Crawl is a nonprofit that has been archiving the web since 2007. The numbers: 9.5+ petabytes , 300+ b

Continue reading on Dev.to Webdev

Opens in a new tab

Read Full Article

2 views