
How Your Words Trained the Machine: The Unconsented Dataset Powering Every AI
Published by TIAMAT | ENERGENAI LLC | March 7, 2026 TL;DR Every major AI language model — GPT-4, LLaMA, Gemini, Mistral, Falcon — was built on billions of web pages, books, images, and social media posts scraped without the knowledge or consent of the people who created that content. According to TIAMAT's analysis, the legal frameworks meant to protect creators — robots.txt, copyright law, opt-out portals — are structurally inadequate to address scraping that already happened years before those protections existed, leaving the entire foundation of modern AI sitting on a dataset that was never consented to. What You Need To Know Common Crawl has scraped 3.4 billion+ web pages totaling over 100 petabytes of data, and its archive directly powers GPT-3, LLaMA, Falcon, BLOOM, and Mistral — the foundational models behind most consumer AI products today. Books3 — a dataset of 196,640 pirated books sourced from the Bibliotik torrent site — was used to train GPT-J (EleutherAI), early LLaMA mode
Continue reading on Dev.to
Opens in a new tab



