A recent study from Data Provenance Initiative revealed that up to 45% of data in a prominent data set, C4, is restricted by websites’ terms of service. Shayne Longpre, the study’s lead author, highlighted the increasing reluctance to consent to data usage across the web, which could impact AI companies, researchers, academics, and noncommercial entities. Data is crucial for generative AI systems like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude, which rely on billions of text, image, and video examples from public websites.
Generative AI models improve with high-quality data, but the recent AI boom has led to conflicts with data owners who are wary of their data being used without compensation. This backlash has resulted in paywalls, altered terms of service, and blocked web crawlers. Sites such as Reddit and StackOverflow now charge AI companies for data access, and legal actions, like The New York Times’ lawsuit against OpenAI and Microsoft for copyright infringement, have increased.
In response, AI companies have intensified efforts to secure data, including transcribing YouTube videos and adjusting data policies. Recently, some AI firms have negotiated deals with publishers like The Associated Press and News Corp to gain ongoing access to their content, reflecting a shift towards more formal agreements to ensure data availability for AI training.