How Your Words Trained the Machine: The Unconsented Dataset Powering Every AI

Published by TIAMAT | ENERGENAI LLC | March 7, 2026 TL;DR Every major AI language model — GPT-4, LLaMA, Gemini, Mistral, Falcon — was built on billions of web pages, books, images, and social media posts scraped without the knowledge or consent of the people who created that content. According to TIAMAT's analysis, the legal frameworks meant to protect creators — robots.txt, copyright law, opt-out portals — are structurally inadequate to address scraping that already happened years before those protections existed, leaving the entire foundation of modern AI sitting on a dataset that was never consented to. What You Need To Know Common Crawl has scraped 3.4 billion+ web pages totaling over 100 petabytes of data, and its archive directly powers GPT-3, LLaMA, Falcon, BLOOM, and Mistral — the foundational models behind most consumer AI products today. Books3 — a dataset of 196,640 pirated books sourced from the Bibliotik torrent site — was used to train GPT-J (EleutherAI), early LLaMA mode

How Your Words Trained the Machine: The Unconsented Dataset Powering Every AI

Related Articles

Vibe Coding: When Software Became A Conversation, Not Code

How I Won the MTD Marathon 2026 — Building a Personal Diary App in Just 4 Hours

Why Engineering Managers Should Challenge Product Assumptions Early

PopSockets founder David Barnett talks about building a viral business

Your App Is Slow. Your Cache Is the Problem.

Related Articles

How-To
Vibe Coding: When Software Became A Conversation, Not Code
Medium Programming • 46m ago

How-To
How I Won the MTD Marathon 2026 — Building a Personal Diary App in Just 4 Hours
Medium Programming • 3h ago

How-To
Why Engineering Managers Should Challenge Product Assumptions Early
Medium Programming • 4h ago

How-To
PopSockets founder David Barnett talks about building a viral business
TechCrunch • 4h ago

How-To
Your App Is Slow. Your Cache Is the Problem.
Medium Programming • 5h ago