arxiv:2407.14933

Consent in Crisis: The Rapid Decline of the AI Data Commons

Published on Jul 20

· Submitted by

mmhamdy on Jul 23

Upvote

Authors:

Robert Mahari ,

Ariel Lee ,

Ahmad Anis ,

Damien Sileo ,

Deividas Mataciunas ,

Abstract

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.

View arXiv page View PDF Add to collection

Community

mmhamdy

Paper author Paper submitter Jul 23

•

edited Jul 23

@librarian-bot recommend

librarian-bot

Jul 23

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

mmhamdy

Paper author Paper submitter Jul 23

This is the first, large-scale audit of the web sources (about 14000 websites) that provide most of the content for some of the most widely used datasets (C4, RefinedWeb, and Dolma) employed in pretraining large language models.