
Every Prompt You've Ever Typed May Be Training an AI Model — Without Your Consent
In 2020, OpenAI released GPT-3. To train it, they used a dataset called The Pile — a massive corpus of internet text scraped without consent from Reddit, books, Wikipedia, GitHub, news sites, and hundreds of other sources. Embedded in that corpus: names, email addresses, phone numbers, private forum conversations, medical questions, financial disclosures, domestic abuse survivor stories, and the intimate details of millions of people's lives. None of them were asked. This is how modern AI is built. And it's still happening, at scale, right now. The Foundation of Modern AI Is Unconsented Human Data Large language models are trained on text. Enormous amounts of it. GPT-4 was trained on an estimated 13 trillion tokens. Claude, Gemini, Llama — all trained on similar-scale datasets derived primarily from one source: the internet. The internet is not a public commons. It is made up of billions of individual acts of writing — forum posts, emails that got leaked, product reviews, medical forum
Continue reading on Dev.to
Opens in a new tab


