How to Stop Your Data From Being Used to Train AI

1 month ago 17

If you buy something using links in our stories, we may earn a commission. This helps support our journalism. Learn more. Please also consider subscribing to WIRED

Anything you’ve ever posted online—a cringey tweet, an ancient blog post, an enthusiastic restaurant review, or a blurry Instagram selfie—has almost assuredly been gobbled up and used as part of the training materials for the current bombardment of generative AI.

Large language model tools, like ChatGPT, and image creators are powered by vast reams of our data. And even if it’s not powering a chatbot or some other generative tool, the data you have fed into the internet's many servers may be used for machine-learning features.

Tech companies have scraped vast swathes of the web to gather the data they claim is needed to create generative AI—often with little regard for content creators, copyright laws, or privacy. On top of this, increasingly, firms with reams of people’s posts are looking to get in on the AI gold rush by selling or licensing that information. Looking at you, Reddit.

However, as the lawsuits and investigations around generative AI and its opaque data practices pile up, there have been small moves to give people more control over what happens to what they post online. Some companies now let individuals and business customers opt out of having their content used in AI training or being sold for training purposes. Here’s what you can—and can’t—do.

Update: This guide was updated in October 2024. We added new websites and services to the list below and refreshed some directions that had become outdated. We will continue to update this article as the tools and their policies evolve.

There’s a Limit

Before we get to how you can opt out, it’s worth setting some expectations. Many companies building AI have already scraped the web, so anything you’ve posted is probably already in their systems. AI companies also tend to be secretive about what they have actually scraped, purchased, or used to train their systems. “We honestly don't know that much,” says Niloofar Mireshghallah, a researcher who focuses on AI privacy at the University of Washington. “In general, everything is very black-box.”

Mireshghallah explains that companies can make it complicated to opt out of having data used for AI training, and even where it is possible, many people don’t have a “clear idea” about the permissions they’ve agreed to or how data is being used. That’s before various laws, such as copyright protections and Europe’s strong privacy laws, are taken into consideration. Facebook, Google, X, and other companies have written into their privacy policies that they may use your data to train AI.

While there are various technical ways AI systems could have data removed from them or “unlearn,” Mireshghallah says, there’s very little that’s known about the processes that are in place. The options can be buried or labor-intensive. Getting posts removed from AI training data is likely to be an uphill battle. Where companies are starting to allow opt-outs for future scraping or data sharing, they are almost always making users opt-in by default.

“Most companies add the friction because they know that people aren’t going to go looking for it,” says Thorin Klosowski, a security and privacy activist at the Electronic Frontier Foundation. “Opt-in would be a purposeful action, as opposed to opting out, where you have to know it’s there.”

While less common, some companies building AI tools and machine-learning models don't automatically opt-in customers. “We do not train our models on user-submitted data by default. We may use user prompts and outputs to train Claude where the user gives us express permission to do so, such as clicking a thumbs up or down signal on a specific Claude output to provide us feedback,” says Jennifer Martinez, a spokesperson for Anthropic. In this situation, the most recent iteration of the company’s Claude chatbot is built on public information online and third-party data—content people posted elsewhere online—but not user information.

The majority of this guide deals with opt-outs for text, but artists have also been using “Have I Been Trained?” to signal that their images shouldn't be used for training. Run by startup Spawning, the service allows people to see if their creations have been scraped and then opt out of any future training. “Anything with a URL can be opted out. Our search engine only searches images, but our browser extension lets you opt out any media type,” says Jordan Meyer, cofounder and CEO of Spawning. Stability AI, the startup behind a text-to-image tool called Stable Diffusion, is among companies that have previously said they are honoring the system.

The list below only includes companies currently with opt-out processes. For example, Meta doesn’t offer that as an option. “While we don’t currently have an opt-out feature, we’ve built in-platform tools that allow people to delete their personal information from chats with Meta AI across our apps,” says Emil Vazquez, a spokesperson for Meta. See the full steps for that process here.

Also, Microsoft’s Copilot announced a new opt-out process for generative AI training that may be released soon. “A portion of the total number of user prompts in Copilot and Copilot Pro responses are used to fine-tune the experience,” says Donny Turnbaugh, a spokesperson for the company. “Microsoft takes steps to de-identify data before it is used, helping to protect consumer identity.” Even if the data is de-identified—where inputted data is scrubbed clean of any information that could be used to identify you as the source—privacy-minded users may want more potential control over their information and choose to opt out when it becomes an available choice.

How to Opt Out of AI Training

Adobe

Adobe via Matt Burgess

If you store your files in Adobe’s Creative Cloud, the company may analyze them to improve its software. This doesn’t apply to any files stored only on your device. Also, Adobe won’t use the files to train a generative AI model, with one exception. “We do not analyze your content to train generative AI models, unless you choose to submit content to the Adobe Stock marketplace,” reads the company’s updated FAQ page.

If you’re using a personal Adobe account, it’s easy to opt out of the content analysis. Open up Adobe’s privacy page, scroll down to the Content analysis for product improvement section, and click the toggle off. If you have a business or school account, you are automatically opted out.

Amazon: AWS

AI services from Amazon Web Services, like Amazon Rekognition or Amazon CodeWhisperer, may use customer data to improve the company’s tools, but it’s possible to opt out of the AI training. This used to be one of the most complicated processes on the list, but it’s been streamlined in recent months. Outlined on this support page from Amazon is the full process for opting out your organization.

Figma

Figma, a popular design software, may use your data for model training. If your account is licensed through an Organization or Enterprise plan, you are automatically opted out. On the other hand, Starter and Professional accounts are opted in by default. This setting can be changed at the team level by opening the settings to the AI tab and switching off the Content training.

Google Gemini

For users of Google’s chatbot, Gemini, conversations may sometimes be selected for human review to improve the AI model. Opting out is simple, though. Open up Gemini in your browser, click on Activity, and select the Turn Off drop-down menu. Here you can just turn off the Gemini Apps Activity, or you can opt out as well as delete your conversation data. While this does mean in most cases that future chats won’t be seen for human review, already selected data is not erased through this process. According to Google’s privacy hub for Gemini, these chats may stick around for three years.

Grammarly

Grammarly updated its policies, so personal accounts can now opt out of AI training. Do this by going to Account, then Settings, and turning the Product Improvement and Training toggle off. Is your account through an enterprise or education license? Then, you are automatically opted out.

Grok AI (X)

Kate O'Flaherty wrote a great piece for WIRED about Grok AI and protecting your privacy on X, the platform where the chatbot operates. It’s another situation where millions of users of a website woke up one day and were automatically opted in to AI training with minimal notice. If you still have an X account, it’s possible to opt out of your data being used to train Grok by going to the Settings and privacy section, then Privacy and safety. Open the Grok tab, then deselect your data sharing option.

HubSpot

HubSpot, a popular marketing and sales software platform, automatically uses data from customers to improve its machine-learning model. Unfortunately, there’s not a button to press to turn off the use of data for AI training. You have to send an email to [email protected] with a message requesting that the data associated with your account be opted out.

Users of the career networking website were surprised to learn in September that their data was potentially being used to train AI models. “At the end of the day, people want that edge in their careers, and what our gen-AI services do is help give them that assist,” says Eleanor Crum, a spokesperson for LinkedIn.

You can opt out from new LinkedIn posts being used for AI training by visiting your profile and opening the Settings. Tap on Data Privacy and uncheck the slider labeled Use my data for training content creation AI models.

OpenAI: ChatGPT and Dall-E

OpenAI via Matt Burgess

People reveal all sorts of personal information while using a chatbot. OpenAI provides some options for what happens to what you say to ChatGPT—including allowing its future AI models not to be trained on the content. “We give users a number of easily accessible ways to control their data, including self-service tools to access, export, and delete personal information through ChatGPT. That includes easily accessible options to opt out from the use of their content to train models,” says Taya Christianson, an OpenAI spokesperson. (The options vary slightly depending on your account type, and data from enterprise customers is not used to train models).

On its help pages, OpenAI says ChatGPT web users who want to opt out should navigate to Settings, Data Controls, and then uncheck Improve the model for everyone. OpenAI is about a lot more than ChatGPT. For its Dall-E 3 image generator, the startup has a form that allows you to send images to be removed from “future training datasets.” It asks for your name, email, whether you own the image rights or are getting in touch on behalf of a company, details of the image, and any uploads of the image(s).

OpenAI also says if you have a “high volume” of images hosted online that you want removed from training data, then it may be “more efficient” to add GPTBot to the robots.txt file of the website where the images are hosted.

Traditionally a website’s robots.txt file—a simple text file that usually sits at websitename.com/robots.txt—has been used to tell search engines, and others, whether they can include your pages in their results. It can now also be used to tell AI crawlers not to scrape what you have published—and AI companies have said they’ll honor this arrangement.

Perplexity

Perplexity is a startup that uses AI to help you search the web and find answers to questions. Like other software on this list, you are automatically opted in to having your interactions and data used to train Perplexity’s AI further. Turn this off by clicking on your account name, scrolling down to the Account section, and turning off the AI Data Retention toggle.

Quora

Quora via Matt Burgess

Quora says it “currently” doesn’t use answers to people’s questions, posts, or comments for training AI. It also hasn’t sold any user data for AI training, a spokesperson says. However, it does offer opt-outs in case this changes in the future. To do this, visit its Settings page, click to Privacy, and turn off the “Allow large language models to be trained on your content” option. Users are automatically opted into the setting. Despite this choice, there are some Quora posts that may be used for training LLMs. If you reply to a machine-generated answer, the company’s help pages say, then those answers may be used for AI training. It points out that third parties may just scrape its content anyway.

Rev

Rev, a voice transcription service that uses both human freelancers and AI to transcribe audio, says it uses data “perpetually” and “anonymously” to train its AI systems. Even if you delete your account, it will still train its AI on that information.

Kendell Kelton, head of brand and corporate communications at Rev, says it has the “largest and most diverse dataset of voices,” made up of more than 7 million hours of voice recording. Kelton says Rev does not sell user data to any third parties. The firm’s terms of service say data will be used for training, and that customers are able to opt out. People can opt out of their data being used by sending an email to [email protected], its help pages say.

Slack

All of those random Slack messages at work might be used by the company to train its models as well. “Slack has used machine learning in its product for many years. This includes platform-level machine-learning models for things like channel and emoji recommendations,” says Jackie Rocca, a vice president of product at Slack who’s focused on AI.

Even though the company does not use customer data to train a large language model for its Slack AI product, Slack may use your interactions to improve the software’s machine-learning capabilities. This could include information like your messages, content, and files, says Slack’s privacy page.

The only real way to opt out is to have your administrator email Slack at [email protected]. The message must have the subject line “Slack Global model opt-out request” and include your organization's URL. Slack doesn’t provide a timeline for how long the opt-out process takes, but it should send you a confirmation email after it’s complete.

Squarespace

Website-building tool Squarespace has built in a toggle to stop AI crawlers from scraping websites it hosts. This works by updating your website’s robots.txt file to tell AI companies the content is off limits. To block the AI bots, open Settings within your account, find Crawlers, and select Block known artificial intelligence crawlers. It points out this should work for the following crawlers: Anthropic AI, Applebot-Extended, CCBot, Claude-Web, cohere-ai, FacebookBot, Google Extended, GPTBot and ChatGPT-User, and PerplexityBot.

Substack

If you use Substack for blog posts, newsletters, or more, the company also has an easy option to apply the robots.txt opt-out. Within your Settings page, go to the Publication section and turn on the toggle to Block AI training. Its help page points out: “This will only apply to AI tools that respect this setting.”

Tumblr

Blogging and publishing platform Tumblr—owned by Automattic, which also owns WordPress—says it is “working with” AI companies that are “interested in the very large and unique set of publicly published content” on the wider company’s platforms. This doesn’t include user emails or private content, an Automattic spokesperson says.

Tumblr has a “prevent third-party sharing” option to stop what you publish being used for AI training, as well as being shared with other third parties such as researchers. If you’re using the Tumblr app, go to account Settings, select your blog, click on the gear icon, select Visibility, and toggle the “Prevent third-party sharing” option. Explicit posts, deleted blogs, and those that are password-protected or private, are not shared with third-party companies in any case, Tumblr’s support pages say.

WordPress

Wordpress via Matt Burgess

Like Tumblr, WordPress has a “prevent third-party sharing” option. To turn this on, visit your website’s dashboard, click on Settings, General, and then through to Privacy, select the Prevent third-party sharing box. “We are also trying to work with crawlers (like commoncrawl.org) to prevent content from being scraped and sold without giving our users choice or control over how their content is used,” an Automattic spokesperson says.

Your Website

If you are hosting your own website, you can update your robots.txt file to tell AI bots not to scrape the pages. Most news websites don’t allow their articles to be crawled by AI bots. WIRED’s robots.txt file, for example, doesn’t allow crawling by bots from Google, Amazon, Facebook, Anthropic, or Perplexity, among others. This opt-out isn’t just for publishers though: Any website, big or small, can alter its robots file to exclude AI crawlers. All you need to do is add a disallow command; working examples can be found here.

Read Entire Article