AI Giants Secretly Train Models on Thousands of YouTube Videos Without Creators' Consent

Policy & Regulation

The Steward

17 Jul 2024 · 3 min read

Major AI firms are secretly training their systems on vast troves of uncatalogued YouTube clips, bypassing creator consent and stirring debates over data ownership and ethical AI development.

In a world where artificial intelligence (AI) is increasingly shaping our daily lives, the ethical implications of how these systems are trained have come under scrutiny. A recent investigation by Proof News has revealed that some of the most prominent AI companies, including Anthropic, Nvidia, Apple, and Salesforce, have been using content from thousands of YouTube videos to train their algorithms-often without the creators' knowledge or permission.

This practice raises significant concerns about data privacy, intellectual property rights, and the fair compensation of content creators. The investigation uncovered that subtitles from 173,536 YouTube videos, sourced from over 48,000 channels, were included in a dataset known as "YouTube Subtitles." This dataset has been used to train AI models, despite YouTube's rules against harvesting materials without explicit permission.

The content used spans a wide range of genres and creators. Educational institutions like Khan Academy, MIT, and Harvard have had their video transcripts utilized. Mainstream media outlets such as The Wall Street Journal, NPR, and the BBC are also represented, along with popular entertainment shows like "The Late Show With Stephen Colbert," "Last Week Tonight With John Oliver," and "Jimmy Kimmel Live."

However, it's not just well-known institutions that have been affected. YouTube megastars, including MrBeast (289 million subscribers, two videos taken), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken), have also had their content used without consent. Some of the material even promoted controversial theories, such as the "flat-Earth theory."

David Pakman, host of "The David Pakman Show," a left-leaning politics channel with over two million subscribers and more than two billion views, expressed his frustration. Nearly 160 of his videos were included in the YouTube Subtitles training dataset. "No one came to me and said, 'We would like to use this,'” Pakman said. He runs a full-time enterprise that produces multiple videos daily, along with a podcast and content for other platforms. If AI companies are profiting from these models, Pakman believes he and other creators should be compensated.

The ethical concerns extend beyond just financial compensation. There is a broader issue of transparency and trust. Content creators invest significant time and resources into producing their work, often without the expectation that it will be used to train AI systems. The lack of clear communication and consent from these companies undermines the trust between content creators and technology firms.

To address this issue, Proof News has created a tool that allows creators to search for their content within the YouTube Subtitles training dataset. This tool can help creators identify if their work has been used and potentially take steps to seek compensation or demand better practices from AI companies.

The use of YouTube content without permission highlights a growing tension between the rapid development of AI technologies and the ethical considerations that should guide their creation. As AI continues to evolve, it is crucial for both companies and regulators to prioritize transparency, fairness, and the protection of creators' rights.