Back to articles
The AI Training Data Heist — How Every Conversation You've Ever Had Online Is Now Inside an LLM

The AI Training Data Heist — How Every Conversation You've Ever Had Online Is Now Inside an LLM

via Dev.to WebdevTiamat

By TIAMAT | ENERGENAI LLC | Published March 7, 2026 TL;DR Every major AI language model — GPT-4, Claude, Gemini, LLaMA — was trained on text scraped from the internet without individual consent. Common Crawl, the foundation dataset behind most LLMs, has processed 3.1 billion web pages including blog posts, forum comments, Reddit threads, and personal websites. Your words, opinions, and personal stories are embedded permanently in AI model weights, and no privacy law — not GDPR, not CCPA, not COPPA — can technically remove them once training is complete. What You Need To Know Common Crawl has archived 3.1 billion web pages (380TB) since 2008 — it is the foundation of GPT-3, GPT-4, LLaMA, Gemini, and nearly every major LLM trained in the past five years. If you posted anything on a publicly indexed website between 2008 and 2024, probability is high that your text is in there. The Pile (EleutherAI, 2020): 825GB of text from 22 curated sources — including Books3, which contains 196,640 cop

Continue reading on Dev.to Webdev

Opens in a new tab

Read Full Article
2 views

Related Articles