
Share
The leak exposes a murky world where tech giants like Anthropic tread legal and ethical lines by using data from websites without clear consent, sparking debates over copyright and data security.
A recent leak has brought to light an internal document from Surge AI, a contractor working with Anthropic, detailing the websites approved and prohibited for training Anthropic’s latest AI models. The list, which was inadvertently left public on Google Docs, highlights significant discrepancies between the content providers' permissions and Anthropic's practices.
The leak is significant for several reasons:
Despite the risks, this situation also presents opportunities for improvement:

The leaked spreadsheet categorizes websites into two lists: "Sites You Can Use" and "Not Approved." The approved list includes reputable sources such as Bloomberg, Harvard University, and the Mayo Clinic. However, at least three of these entities have denied having any AI training agreements with Anthropic.
On the other hand, the blacklist includes companies like The New York Times and Reddit, which have previously sued AI startups for scraping their content without permission. This list suggests that Surge AI's gig workers were explicitly instructed not to use these sources for training purposes.
According to a law professor cited by Business Insider, the legal implications of using whitelisted but copyrighted material may be limited in terms of fair use. However, this does not absolve Anthropic from potential copyright infringement claims if content providers decide to pursue legal action.
The leak of Anthropic's internal document highlights significant issues in data sourcing and legal compliance within the AI industry. While there are opportunities for improvement, the incident underscores the need for greater transparency and robust security measures to protect both companies and content providers.
Tags
Original Sources
About the author
Marcus began tracking AI's market implications in 2016, noticing AI-related patent filings accelerating ahead of earnings upgrades before most of the sell-side had caught on. A former fixed-income quantitative analyst, he spent two decades building models that priced risk across emerging markets before pivoting to cover the economic impact of AI full-time. His writing translates opaque technical developments into clear risk/reward terms — and he's rarely diplomatic about the gap between AI valuations and underlying fundamentals. He believes most market participants still underestimate AI's long-run deflationary effect on knowledge work.
More from The Analyst →This Week's Edition
24 July 2025
88 articles
Related Articles
Related Articles
More Stories