An investigation revealed that some of the world's richest tech companies have used content from thousands of YouTube videos to train their systems. These companies did so despite YouTube's rules against using materials from the platform without permission.
More topics for you.This topic continues below.
Google unveils breakthrough Quantum chip that defies limits
An investigation by Proof News found that some of the world’s biggest tech companies used transcripts from over 173,000 YouTube videos to train their AI models without permission. The dataset, made by the nonprofit EleutherAI, includes transcripts from more than 48,000 YouTube channels. Companies like Apple, NVIDIA, and Anthropic used this data.
This issue highlights a big problem: AI technology often relies on data taken from creators without their consent or payment. The dataset doesn’t have actual videos or images, but it includes transcripts from popular creators like Marques Brownlee and MrBeast, as well as major news outlets like The New York Times, BBC, and ABC News. Transcripts from Engadget videos are also in the dataset.
Marques Brownlee, also known as MKBHD, shared his concerns on X (formerly Twitter). He said, "Apple got data for their AI from several companies. One of them took lots of data/transcripts from YouTube videos, including mine. This will be a problem for a long time."
A Google spokesperson told Engadget that using YouTube data to train AI models breaks the platform’s rules. This supports what YouTube CEO Neal Mohan said earlier. Apple, NVIDIA, Anthropic, and EleutherAI did not respond to Engadget’s request for comments.
Tech companies are not always clear about where they get their training data. Earlier this month, artists and photographers criticized Apple for not revealing the sources of data for Apple Intelligence, their new AI coming to millions of devices this year.
YouTube is the world’s largest video site, full of transcripts, audio, video, and images, making it a valuable resource for AI training. Earlier this year, OpenAI’s CTO, Mira Murati, avoided questions from The Wall Street Journal about using YouTube videos to train Sora, OpenAI’s upcoming AI video tool. Murati said the data was "publicly available or licensed." Alphabet CEO Sundar Pichai also said using YouTube data for AI training breaks the platform’s rules.
To see if transcripts from your YouTube videos or your favorite channels are in the dataset, visit Proof News' lookup tool.