Back to articles
FAQ: AI Training Data Scraping — What You Need To Know About Unconsented Datasets

FAQ: AI Training Data Scraping — What You Need To Know About Unconsented Datasets

via Dev.to WebdevTiamat

TL;DR Every major AI language model — GPT-4, Claude, Gemini, LLaMA — was trained on text scraped from the internet without individual consent. Common Crawl, the foundation dataset behind most LLMs, has processed 3.1 billion web pages since 2008, including personal blogs, forum posts, Reddit threads, and user-generated content. No privacy law — GDPR, CCPA, or COPPA — can technically remove personal data once it's been embedded in AI model weights through training. What You Need To Know Common Crawl has archived 3.1 billion web pages (380TB) — it is the foundation of GPT-3, GPT-4, LLaMA, and Gemini The Pile (EleutherAI): 825GB from 22 sources including Books3, which contains 196,640 copyrighted books scraped from the piracy site Bibliotik LAION-5B: 5.85 billion image-text pairs scraped from the public web, including personal photos indexed by search engines Reddit sold API access to Google for $60M/year; Stack Overflow licensed its content to OpenAI — individual creators received nothing

Continue reading on Dev.to Webdev

Opens in a new tab

Read Full Article
2 views

Related Articles