Pile-T5: Enhancing T5 with Better Tokenization and Diverse Data

Models & Research

The Engineer

16 Apr 2024 · 3 min read

EleutherAI upgrades T5 with Pile-T5, leveraging a more sophisticated tokenizer and diverse training data to overcome limitations in code tokenization and dataset quality, pushing NLP capabilities forward.

The T5 model, introduced by Raffel et al. in 2019, has become a cornerstone of the NLP community, with its base model being downloaded millions of times from Hugging Face. Despite its popularity, the original T5 tokenizer omits important code-related tokens, and subsequent pretraining datasets have offered higher quality filtering and more diverse domains. To address these limitations, EleutherAI introduces Pile-T5, a new version of T5 trained on the Pile dataset (Gao et al., 2020) and using the LLaMA tokenizer (Touvron et al., 2023).

Model Description

Pile-T5 is an enhanced version of T5 that replaces the original pretraining dataset with the Pile and switches to the LLaMA tokenizer. This change aims to improve the model's performance on code-related tasks and other downstream applications. Here are the key technical details:

Training Data: Trained on the Pile, a high-quality, diverse dataset containing 825GB of text from various domains.
Tokenizer: Uses the LLaMA tokenizer, which includes more code-related tokens compared to the original T5 tokenizer.
Training Steps: Trained for 2 million steps or 2 trillion tokens in total-twice the amount of the original T5 model.
Pretraining Method: Utilizes the span corruption method, similar to the original T5.

Performance Improvements

The improvements in Pile-T5 are significant, especially in token-matched settings. Here’s a breakdown:

Downstream Tasks: Pile-T5 outperforms the widely used T5-v1.1 models on various downstream tasks.
Code Tasks: Notably, Pile-T5 excels in code-related tasks, thanks to the inclusion of more code tokens in the LLaMA tokenizer.

Implementation Details

To ensure reproducibility and transparency, EleutherAI has released all necessary resources:

Experiment Scripts: Available on GitHub at EleutherAI/improved-t5.
Model Checkpoints: Accessible from EleutherAI's Hugging Face page at Hugging Face Collections.
- Main Branch: Contains the final model trained for 2 million steps.
- Intermediate Checkpoints: Released every 10,000 steps to facilitate research on model evolution over time.
T5x Versions: Available at Hugging Face Collections.

Going Beyond 1 Trillion Tokens

Pile-T5 models were rigorously evaluated on several benchmarks:

SuperGLUE: A suite of challenging NLP tasks.
CodeXGLUE: Benchmarks for code-related tasks.
MMLU (Multilingual Language Understanding): Tests the model's ability to handle multiple languages.
Bigbench Hard (BBH): Challenges models with complex reasoning tasks.

Comparisons were made against T5v1.1 and Flan-T5 models, both of which were finetuned over the same amount of tokens. Pile-T5 consistently outperformed these models across all benchmarks, particularly in code-related tasks.

Conclusion

Pile-T5 represents a significant step forward in enhancing the T5 model's capabilities by leveraging a more diverse and high-quality pretraining dataset and an improved tokenizer. These changes lead to better performance on downstream tasks, especially those involving code. For researchers and practitioners looking to push the boundaries of NLP models, Pile-T5 offers a compelling alternative to the original T5.