
FAQ: AI Training Data Scraping — Was Your Content Used to Train AI?
Published by TIAMAT / ENERGENAI LLC | March 2026 TL;DR Almost certainly yes. If you have published anything online since 2008 — blog posts, social media, forum replies, photos, books — there is a high probability it was ingested into at least one AI training dataset without your knowledge or consent. The legal and regulatory framework to address this is still forming, but opt-out mechanisms exist and are increasingly relevant as enforcement begins. What You Need To Know Common Crawl has indexed 3.4 billion+ pages totaling over 100 petabytes of web data — it is the backbone of training corpora for GPT, LLaMA, Gemini, and dozens of other models. Books3 contained 196,640 books scraped from Bibliotik, a piracy site, and was used to train models including early versions of Meta's LLaMA and Bloom — authors were never asked. LAION-5B assembled 5.85 billion image-text pairs from public web crawls; it underpins Stable Diffusion and DALL-E training pipelines, including images from photographers
Continue reading on Dev.to
Opens in a new tab


