Leaked Document Reveals Anthropic's Whitelist and Blacklist for AI Training Data

Security & Risk

The Analyst

24 Jul 2025 · 3 min read

The leak exposes a murky world where tech giants like Anthropic tread legal and ethical lines by using data from websites without clear consent, sparking debates over copyright and data security.

A recent leak has brought to light an internal document from Surge AI, a contractor working with Anthropic, detailing the websites approved and prohibited for training Anthropic’s latest AI models. The list, which was inadvertently left public on Google Docs, highlights significant discrepancies between the content providers' permissions and Anthropic's practices.

Why it Matters

The leak is significant for several reasons:

Legal Implications: It raises questions about the legality of using copyrighted material for AI training without explicit permission.
Data Security: The exposure of internal documents underscores potential security vulnerabilities in data handling processes.
Transparency and Trust: It highlights a lack of transparency from Anthropic regarding its data sources, potentially eroding trust with content providers.

Key Risks

Legal Action: At least three organizations-The Mayo Clinic, Cornell University, and Morningstar-have stated they have no AI training agreements with Anthropic. This could lead to legal challenges if these entities decide to pursue action against the company.
Reputational Damage: The leak may damage Anthropic's reputation among content providers and the broader public, particularly if other organizations follow suit in expressing concerns.
Regulatory Scrutiny: The incident may attract attention from regulatory bodies, leading to investigations into Anthropic’s data practices.

The Opportunity

Despite the risks, this situation also presents opportunities for improvement:

Enhanced Transparency: Anthropic can take steps to be more transparent about its data sources and usage, potentially rebuilding trust with content providers.
Legal Compliance: By addressing the concerns raised by the leak, Anthropic can ensure it is in compliance with copyright laws, thereby reducing legal risks.
Data Security Improvements: The incident highlights the need for better security measures to prevent unauthorized access to sensitive documents.

Detailed Findings

The leaked spreadsheet categorizes websites into two lists: "Sites You Can Use" and "Not Approved." The approved list includes reputable sources such as Bloomberg, Harvard University, and the Mayo Clinic. However, at least three of these entities have denied having any AI training agreements with Anthropic.

On the other hand, the blacklist includes companies like The New York Times and Reddit, which have previously sued AI startups for scraping their content without permission. This list suggests that Surge AI's gig workers were explicitly instructed not to use these sources for training purposes.

Legal Perspective

According to a law professor cited by Business Insider, the legal implications of using whitelisted but copyrighted material may be limited in terms of fair use. However, this does not absolve Anthropic from potential copyright infringement claims if content providers decide to pursue legal action.

Company Responses

Anthropic: The company stated it was unaware of the list and that Surge AI created it.
Surge AI: The contractor declined to comment on the creation of the list but locked down dozens of files for the project after Business Insider reached out. Surge AI also said it is "looking closely" into the security lapse.

Conclusion

The leak of Anthropic's internal document highlights significant issues in data sourcing and legal compliance within the AI industry. While there are opportunities for improvement, the incident underscores the need for greater transparency and robust security measures to protect both companies and content providers.