Major news organizations are actively blocking the Internet Archive’s Wayback Machine to prevent their content from being used to train artificial intelligence models.
An analysis by Originality AI reveals that 23 major news sites now block 'ia_archiverbot,' the Wayback Machine’s web crawler. The crackdown includes high-profile outlets such as The New York Times and Reddit.
USA Today Co. has implemented blocks across its network of more than 200 media outlets. The Guardian is also restricting access, allowing crawling but filtering archived content from public view, which creates digital dead ends for researchers.
Publishers cite copyright and competition concerns
Publishers claim these measures are necessary to stop AI companies from scraping archives to build competing products. The New York Times stated that archived content is being used "to directly compete with us," though the company did not provide specific evidence of copyright violations.
USA Today Co. describes its actions as routine bot prevention. However, the move removes a primary tool used by journalists to verify historical accuracy and track editorial changes.
For three decades, the Wayback Machine has preserved over a trillion web pages. The current wave of blocking threatens the long-term accessibility of the public web as publishers prioritize protecting intellectual property from large language model developers.