Every major AI company trains on data from the internet. Your public social media posts, your Stack Overflow answers, your Reddit comments, your product reviews, your blog posts — they are almost certainly in AI training datasets. In many cases, your private data is also being used — and you agreed to it in terms of service you did not read.
What AI Companies Admitted They Trained On
OpenAI: GPT models trained on Common Crawl (massive web scrape), books (Books1, Books2 datasets), GitHub code, Wikipedia, and more. In 2023, OpenAI confirmed they trained on personal data scraped from the web. Italian regulators temporarily banned ChatGPT over GDPR concerns about training data collection.
Google DeepMind: Gemini trained on "a multilingual and multimodal dataset including web documents, books, and code." In 2023, Google updated its privacy policy to explicitly state it could use public Google Docs, Google Maps reviews, and Google Search data to train AI — causing significant backlash.
Meta: Llama models trained on data including Facebook and Instagram posts. In 2024, Meta announced it would use European users' social media posts to train AI unless users opted out — the Irish Data Protection Commission intervened.
What You Sent to ChatGPT That Gets Used
If you use ChatGPT free tier (or had History enabled in the past): your conversations may be used to improve OpenAI models unless you specifically opted out. The same applies to many other AI tools. Every time you typed your business strategy into ChatGPT, described your health symptoms, shared personal relationship problems, or provided confidential client information — that conversation potentially became training data.
How to Actually Opt Out and Protect Your Data
- ChatGPT: Settings → Data Controls → Improve the model for everyone → Turn OFF. This stops your conversations from being used for training.
- Google Gemini: My Activity → Other Google Activity → Gemini Apps Activity → Turn off. Also pause activity saves.
- Claude.ai: Privacy settings → Conversation history and training opt-out available. Check current settings in your account.
- For maximum privacy: Run AI locally (Gemma 3, Llama 4 on your own hardware) — your data never leaves your device.
- For enterprise: ChatGPT Team/Enterprise and Claude Enterprise do not use data for training — read the enterprise data agreements carefully.
The GDPR Protection Europeans Have That Others Do Not
EU citizens have significantly stronger AI data rights under GDPR: the right to know what data is held, the right to deletion, the right to object to processing for AI training, and real enforcement with fines up to 4% of global revenue. The Irish DPC has intervened multiple times against US AI companies using European data for training without proper legal basis. If you are in the EU: exercise these rights actively via each AI company's data request portal.
AI Data Privacy — FAQ
Your data and AI training questions