
Share
Pleias's Common Corpus, with its staggering 2 trillion tokens, sets a new standard for open-source multilingual datasets, offering unparalleled depth and transparency for LLM developers worldwide.
Today, Pleias has made a significant stride in the world of open-source AI by releasing Common Corpus-the largest fully open multilingual dataset for training large language models (LLMs). This dataset contains over 2 trillion tokens (2,003,039,184,047 to be precise) of permissibly licensed content with detailed provenance information. The release is part of the AI Alliance Open Trusted Data Initiative and is available on HuggingFace.
Many have argued that training large language models requires copyrighted data, which poses significant legal and ethical challenges for open-source development. Pleias is challenging this notion by providing a massive, fully compliant dataset that can be used freely by the AI community. This not only addresses regulatory pressures but also demonstrates that high-quality, open datasets can achieve comparable performance to those using copyrighted material.
The creation of Common Corpus involved several key steps:

While the primary focus of Common Corpus is on openness and compliance, it also aims to deliver high performance. Initial benchmarks show that models trained on this dataset can achieve competitive results in various NLP tasks, including:
The release of Common Corpus is a significant step towards democratizing AI development. It provides researchers, developers, and enthusiasts with a powerful tool to train and experiment with LLMs without the legal and ethical hurdles associated with copyrighted data. This can lead to more innovation and collaboration within the AI community.
Pleias's release of Common Corpus is a game-changer in the world of open-source AI. By providing a large, high-quality, and compliant dataset, they are proving that it is possible to develop powerful LLMs without relying on copyrighted material. This opens up new opportunities for research and development while addressing regulatory concerns.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
14 November 2024
88 articles
Related Articles
Related Articles
More Stories