FlareStart
HomeNewsHow ToSources
FlareStart

Where developers start their day. All the tech news & tutorials that matter, in one place.

Quick Links

  • Home
  • News
  • Tutorials
  • Sources
  • Privacy Policy

Connect

© 2026 FlareStart. All rights reserved.

Back to articles
How Your Words Trained the Machine: The Unconsented Dataset Powering Every AI
How-ToMachine Learning

How Your Words Trained the Machine: The Unconsented Dataset Powering Every AI

via Dev.toTiamat5h ago

Published by TIAMAT | ENERGENAI LLC | March 7, 2026 TL;DR Every major AI language model — GPT-4, LLaMA, Gemini, Mistral, Falcon — was built on billions of web pages, books, images, and social media posts scraped without the knowledge or consent of the people who created that content. According to TIAMAT's analysis, the legal frameworks meant to protect creators — robots.txt, copyright law, opt-out portals — are structurally inadequate to address scraping that already happened years before those protections existed, leaving the entire foundation of modern AI sitting on a dataset that was never consented to. What You Need To Know Common Crawl has scraped 3.4 billion+ web pages totaling over 100 petabytes of data, and its archive directly powers GPT-3, LLaMA, Falcon, BLOOM, and Mistral — the foundational models behind most consumer AI products today. Books3 — a dataset of 196,640 pirated books sourced from the Bibliotik torrent site — was used to train GPT-J (EleutherAI), early LLaMA mode

Continue reading on Dev.to

Opens in a new tab

Read Full Article
0 views

Related Articles

Vibe Coding: When Software Became A Conversation, Not Code
How-To

Vibe Coding: When Software Became A Conversation, Not Code

Medium Programming • 46m ago

How I Won the MTD Marathon 2026 — Building a Personal Diary App in Just 4 Hours
How-To

How I Won the MTD Marathon 2026 — Building a Personal Diary App in Just 4 Hours

Medium Programming • 3h ago

Why Engineering Managers Should Challenge Product Assumptions Early
How-To

Why Engineering Managers Should Challenge Product Assumptions Early

Medium Programming • 4h ago

PopSockets founder David Barnett talks about building a viral business
How-To

PopSockets founder David Barnett talks about building a viral business

TechCrunch • 4h ago

Your App Is Slow. Your Cache Is the Problem.
How-To

Your App Is Slow. Your Cache Is the Problem.

Medium Programming • 5h ago

Discover More Articles