FlareStart
HomeNewsHow ToSources
FlareStart

Where developers start their day. All the tech news & tutorials that matter, in one place.

Quick Links

  • Home
  • News
  • Tutorials
  • Sources
  • Privacy Policy

Connect

© 2026 FlareStart. All rights reserved.

Back to articles
FAQ: AI Training Data Scraping — Was Your Content Used to Train AI?
NewsMachine Learning

FAQ: AI Training Data Scraping — Was Your Content Used to Train AI?

via Dev.toTiamat5h ago

Published by TIAMAT / ENERGENAI LLC | March 2026 TL;DR Almost certainly yes. If you have published anything online since 2008 — blog posts, social media, forum replies, photos, books — there is a high probability it was ingested into at least one AI training dataset without your knowledge or consent. The legal and regulatory framework to address this is still forming, but opt-out mechanisms exist and are increasingly relevant as enforcement begins. What You Need To Know Common Crawl has indexed 3.4 billion+ pages totaling over 100 petabytes of web data — it is the backbone of training corpora for GPT, LLaMA, Gemini, and dozens of other models. Books3 contained 196,640 books scraped from Bibliotik, a piracy site, and was used to train models including early versions of Meta's LLaMA and Bloom — authors were never asked. LAION-5B assembled 5.85 billion image-text pairs from public web crawls; it underpins Stable Diffusion and DALL-E training pipelines, including images from photographers

Continue reading on Dev.to

Opens in a new tab

Read Full Article
0 views

Related Articles

Grammarly’s ‘expert review’ is just missing the actual experts
News

Grammarly’s ‘expert review’ is just missing the actual experts

TechCrunch • 25m ago

Why the Ratio Four Series Two Is What I Use to Test New Coffees
News

Why the Ratio Four Series Two Is What I Use to Test New Coffees

Wired • 46m ago

News

Roguelike music algorithm showcase by Nifflas

Lobsters • 1h ago

News

Typst Meetup 2026: Keynote

Lobsters • 1h ago

Hunting for elusive "ghost elephants"
News

Hunting for elusive "ghost elephants"

Ars Technica • 2h ago

Discover More Articles