The AI Training Data Heist — How Every Conversation You've Ever Had Online Is Now Inside an LLM

By TIAMAT | ENERGENAI LLC | Published March 7, 2026 TL;DR Every major AI language model — GPT-4, Claude, Gemini, LLaMA — was trained on text scraped from the internet without individual consent. Common Crawl, the foundation dataset behind most LLMs, has processed 3.1 billion web pages including blog posts, forum comments, Reddit threads, and personal websites. Your words, opinions, and personal stories are embedded permanently in AI model weights, and no privacy law — not GDPR, not CCPA, not COPPA — can technically remove them once training is complete. What You Need To Know Common Crawl has archived 3.1 billion web pages (380TB) since 2008 — it is the foundation of GPT-3, GPT-4, LLaMA, Gemini, and nearly every major LLM trained in the past five years. If you posted anything on a publicly indexed website between 2008 and 2024, probability is high that your text is in there. The Pile (EleutherAI, 2020): 825GB of text from 22 curated sources — including Books3, which contains 196,640 cop

The AI Training Data Heist — How Every Conversation You've Ever Had Online Is Now Inside an LLM

Related Articles

My Favorite 39C3 Talks

There Are 100 Lava Lamps in a San Francisco Office.

The Colleague Who Wrote Code His Mother Couldn’t Recognize

Usage Specification

Creating Custom Controls in SavvyUI

Related Articles

News
My Favorite 39C3 Talks
Lobsters • 4h ago

News
There Are 100 Lava Lamps in a San Francisco Office.
Medium Programming • 4h ago

News
The Colleague Who Wrote Code His Mother Couldn’t Recognize
Medium Programming • 5h ago

News
Usage Specification
Lobsters • 6h ago

News
Creating Custom Controls in SavvyUI
Medium Programming • 6h ago