In response to the suits, defendants such as Meta, OpenAI, and Bloomberg have argued that their actions constitute fair use. A case against EleutherAI, which originally scraped the books and made them public, was voluntarily dismissed by the plaintiffs.
Litigation in remaining cases remains in the early stages, leaving the questions surrounding permission and payment unresolved. The Pile has since been removed from its official download site, but it’s still available on file-sharing services.
“Technology companies have run roughshod,” said Amy Keller, a consumer protection attorney and partner at the firm DiCello Levitt who has brought lawsuits on behalf of creatives whose work was allegedly scooped up by AI firms without their consent.
“People are concerned about the fact that they didn’t have a choice in the matter,” Keller said. “I think that’s what’s really problematic.”
Parroting a Parrot
Many creators feel uncertain about the path ahead.
Full-time YouTubers patrol for unauthorized use of their work, regularly filing takedown notices, and some worry it’s only a matter of time before AI can generate content similar to what they make—if not produce outright copycats.
Pakman, the creator of The David Pakman Show, saw the power of AI recently while scrolling on TikTok. He came across a video that was labeled as a Tucker Carlson clip, but when Pakman watched it, he was taken aback. It sounded like Carlson but was, word for word, what Pakman had said on his YouTube show, down to the cadence. He was equally alarmed that only one of the video’s commenters seemed to recognize that it was fake—a voice clone of Carlson reading Pakman’s script.
“This is going to be a problem,” Pakman said in a YouTube video he made about the fake. “You can do this essentially with anybody.”
EleutherAI cofounder Sid Black wrote on GitHub that he created YouTube Subtitles by using a script. That script downloads the subtitles from YouTube’s API in the same way a YouTube viewer’s browser downloads them when watching a video. According to documentation on GitHub, Black used 495 search terms to cull videos, including “funny vloggers,” “Einstein,” “black protestant,” “Protective Social Services,” “infowars,” “quantum chromodynamics,” “Ben Shapiro,” “Uighurs,” “fruitarian,” “cake recipe,” ”Nazca lines,” and “flat earth.”
Though YouTube’s terms of service prohibit accessing its videos by “automated means,” more than 2,000 GitHub users have bookmarked or endorsed the code.
“There are many ways in which YouTube could prevent this module from working if that was what they are after,” wrote machine learning engineer Jonas Depoix in a discussion on GitHub, where he published the code Black used to access YouTube subtitles. “This hasn't happened so far.”
In an email to Proof News, Depoix said he hasn’t used the code since he wrote it as a university student for a project several years ago and was surprised people found it useful. He declined to answer questions about YouTube’s rules.
Google spokesperson Jack Malon said in an email response to a request for comment that the company has taken “action over the years to prevent abusive, unauthorized scraping.” He did not respond to questions about other companies’ use of the material as training data.
Among the videos used by AI companies are 146 from Einstein Parrot, a channel with nearly 150,000 subscribers. The African grey’s caretaker, Marcia, who didn’t want to use her last name for fear of endangering the famous bird’s safety, said at first she thought it was funny to learn AI models had ingested words of a mimicking parrot.
“Who would want to use a parrot’s voice?” Marcia said. “But then, I know that he speaks very well. He speaks in my voice. So he’s parroting me, and then AI is parroting the parrot.”
Once ingested by AI, data cannot be unlearned. Marcia was troubled by all the unknown ways in which her bird’s information could be used, including creating a digital duplicate parrot and, she worried, making it curse.
“We’re treading on uncharted territory,” Marcia said.