
Share
LAION unveils Re-LAION-5B, a meticulously cleaned dataset free from harmful content, marking a significant step towards safer AI research and bolstering public trust in emerging technologies.
In a world where the internet is both a vast resource and a potential source of harm, ensuring that data used to train artificial intelligence (AI) models is safe and ethical is paramount. Today, LAION e.V., a leading organization in open-source AI research, has announced the release of Re-LAION-5B, an updated version of their LAION-5B dataset. This new iteration addresses significant safety concerns, particularly the presence of links to suspected Child Sexual Abuse Material (CSAM).
The integrity and ethical use of data are crucial for building trust in AI systems. When datasets contain harmful content, it not only risks legal and moral repercussions but also undermines the potential benefits of AI research. Re-LAION-5B represents a significant step forward in ensuring that the data used to train AI models is both safe and reliable.
Re-LAION-5B is an updated version of LAION-5B, which was previously one of the largest web-scale datasets linking text to images. The original dataset faced criticism after a report by the Stanford Internet Observatory in December 2023 revealed issues related to CSAM links. In response, LAION partnered with organizations like the Internet Watch Foundation (IWF), the Canadian Center for Child Protection (C3P), and the Stanford Internet Observatory to thoroughly clean the dataset.
CSAM Link Removal: A total of 2,236 links were removed from the dataset after matching them with lists of known CSAM hashes provided by IWF and C3P. This includes 1,008 links identified in the Stanford report. It's important to note that many of these links are likely no longer active due to ongoing efforts to remove such content from the internet.
Dataset Size: Despite the removals, Re-LAION-5B remains a robust resource with 5.5 billion (5,526,641,167) text-link to images pairs.
Metadata for Third Parties: The metadata for Re-LAION-5B is available for third parties to use in cleaning their own derivatives of LAION-5B. This includes a set of diffs that can be used to remove matched content without disclosing the identity of potentially illegal material. These diffs are safe and do not significantly impact the dataset's overall size or usability.

Benefits:
Risks:
The release of Re-LAION-5B sets a precedent for responsible data management in AI research. It highlights the importance of collaboration between researchers, ethical organizations, and regulatory bodies to ensure that AI development is both innovative and safe. By addressing these issues proactively, LAION demonstrates a commitment to ethical practices that can inspire other organizations to follow suit.
Re-LAION-5B is more than just an updated dataset; it's a step towards a safer and more transparent future in AI research. As the field continues to evolve, ensuring the integrity of the data used to train AI models will be crucial for building trustworthy and beneficial technologies.
Tags
Original Sources
About the author
Amara's entry point into AI was an epidemiology role at a London research hospital, where she spent five years studying how digital health tools reached — or conspicuously failed to reach — underserved communities. Watching early algorithmic systems in healthcare quietly entrench existing inequalities, she redirected her career toward the systemic consequences of AI at scale. She covers AI through an unflinching lens: who benefits, who bears the cost, and what evidence actually says versus what the press release claims. Her writing is calm and precise, but she doesn't mistake balance for neutrality.
More from The Steward →This Week's Edition
2 September 2024
88 articles
Related Articles
Related Articles
More Stories