Generative AI systems are trained by letting them surf the web to scrape content. Apple allows publishers to opt out of its scraping, and a new report says that many of the biggest websites have specifically opted out of Apple Intelligence training.
This includes both Facebook and Instagram, as well as many high-profile news and media sites like The New York Times and The Atlantic …
Apple’s AI training
Large language models like ChatGPT are trained by giving them access to millions of words of source material, ranging from news stories to user comments.
In Apple’s case, the company has for years been using Applebot to train Siri and surface Spotlight suggestions. More recently, the company has also been using Applebot to train Apple Intelligence.
The practice is controversial, as AIs are effectively using copyrighted material to generate their own versions of it. For more niche topics, where source material is scarce, they have even been found to regurgitate entire paragraphs with almost no changes made.
But Apple does this in an ethical way, allowing publishers to opt out, and screening out personal data (though it did get caught out by one third-party source).
We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control […]
We apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet.
Apple uses an Applebot-Extended tag to allow sites to opt out of AI training while still allowing search indexing – meaning that their pieces can still be included in Spotlight and Siri searches.
Many big web publishers opting out
Since opting out is done using a publicly-accessible robots.txt file, it’s easy to see which sites have done this. Wired checked a number of the biggest news and social media sites.
WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training […]
In a separate analysis conducted this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed (294 of 1,167 primarily English-language, US-based publications) are blocking Applebot-Extended.
Applebot-Extended is a relatively new tag, so it’s likely that more websites will also opt out once awareness increases.
Money is of course one factor
Apple is believed to have struck deals with some media companies, paying a fee in return for the right to use their content for training. It’s likely this is the motivation for at least some sites currently blocking Apple – holding out for a payment offer.
“A lot of the largest publishers in the world are clearly taking a strategic approach,” says Originality AI founder Jon Gillham. “I think in some cases, there’s a business strategy involved—like, withholding the data until a partnership agreement is in place.”
iOS 18.1 beta 3 includes several new Apple Intelligence features, including Photo Clean Up and more notification summaries.
Photo by Kelli McClintock on Unsplash
FTC: We use income earning auto affiliate links. More.