Digital Content Next, a trade body representing U.S. digital publishers, has sent a cease and desist letter to the Common Crawl Foundation, demanding it stop scraping publisher content and remove material already in its datasets, according to a report by Reuters on June 9, 2026.
The letter, dated June 8, 2026, alleges that Common Crawl's web crawling activities violate copyright laws by systematically collecting and distributing copyrighted content without authorization. Common Crawl, a nonprofit that provides free web crawl data for research and AI training, has not yet publicly responded to the letter as of June 10, 2026.
This action follows growing tensions between content creators and AI developers over the use of scraped data for training large language models. Digital Content Next represents major publishers including The New York Times, The Wall Street Journal, and The Washington Post.
The dispute highlights the ongoing legal and ethical debates around web scraping for AI training, with publishers seeking compensation or opt-out mechanisms for their content. Common Crawl's datasets have been widely used by companies like OpenAI and Google for training AI systems.