Pleias Releases 2 Trillion Token Open Multilingual Dataset for LLM Training

Models & Research

The Engineer

14 Nov 2024 · 3 min read

Pleias's Common Corpus, with its staggering 2 trillion tokens, sets a new standard for open-source multilingual datasets, offering unparalleled depth and transparency for LLM developers worldwide.

Today, Pleias has made a significant stride in the world of open-source AI by releasing Common Corpus-the largest fully open multilingual dataset for training large language models (LLMs). This dataset contains over 2 trillion tokens (2,003,039,184,047 to be precise) of permissibly licensed content with detailed provenance information. The release is part of the AI Alliance Open Trusted Data Initiative and is available on HuggingFace.

Why This Matters

Many have argued that training large language models requires copyrighted data, which poses significant legal and ethical challenges for open-source development. Pleias is challenging this notion by providing a massive, fully compliant dataset that can be used freely by the AI community. This not only addresses regulatory pressures but also demonstrates that high-quality, open datasets can achieve comparable performance to those using copyrighted material.

Key Features of Common Corpus

Truly Open: The dataset contains only permissively licensed data, and every source is documented with provenance information.
Multilingual: While the majority of the content is in English and French, it includes at least 1 billion tokens for over 30 languages.
Diverse: It encompasses a wide range of sources, including scientific articles, government and legal documents, code, and cultural heritage data like books and newspapers.
Extensively Curated: The dataset has undergone rigorous curation to ensure high quality. This includes:
- Correcting spelling and formatting issues in digitized texts
- Removing harmful and toxic content
- Filtering out low-educational-value content

Technical Details and Implementation

The creation of Common Corpus involved several key steps:

Data Collection: Pleias sourced data from a variety of open repositories, ensuring that all content was permissively licensed.
Preprocessing: Extensive preprocessing was applied to clean the data. This included:
- Spell Checking: Using advanced NLP techniques to correct common spelling errors in digitized texts.
- Toxicity Filtering: Employing models trained on toxicity detection datasets to remove harmful content.
- Quality Control: Implementing heuristics and machine learning models to filter out low-quality or irrelevant data.

Tokenization: The dataset was tokenized using a custom tokenizer designed to handle the multilingual nature of the content. This ensures that the tokens are consistent across different languages, which is crucial for training effective LLMs.

Benchmarks and Performance

While the primary focus of Common Corpus is on openness and compliance, it also aims to deliver high performance. Initial benchmarks show that models trained on this dataset can achieve competitive results in various NLP tasks, including:

Language Modeling: Models trained on Common Corpus perform well on standard language modeling benchmarks.
Translation: The multilingual nature of the dataset makes it particularly useful for training translation models.
Text Generation: The diverse content helps improve the coherence and quality of generated text.

Community Impact

The release of Common Corpus is a significant step towards democratizing AI development. It provides researchers, developers, and enthusiasts with a powerful tool to train and experiment with LLMs without the legal and ethical hurdles associated with copyrighted data. This can lead to more innovation and collaboration within the AI community.

Conclusion

Pleias's release of Common Corpus is a game-changer in the world of open-source AI. By providing a large, high-quality, and compliant dataset, they are proving that it is possible to develop powerful LLMs without relying on copyrighted material. This opens up new opportunities for research and development while addressing regulatory concerns.