AutoMathText: Enhancing Mathematical Reasoning with Autonomous Data Selection and Continual Pretraining

Models & Research

The Engineer

14 Feb 2024 · 3 min read

AutoMathText uses autonomous data selection to enhance language models' mathematical skills, allowing them to choose the best training content without human intervention and significantly outperform previous benchmarks.

In a significant step forward for language models' mathematical proficiency, researchers from the University of Cambridge and Tsinghua University have introduced AutoMathText. This novel approach leverages base language models to autonomously select high-quality mathematical content for continual pretraining. The result is a 7B-parameter Mistral model that achieves substantial improvements on downstream tasks with a notable reduction in token usage.

Key Changes and Why They Matter

Autonomous Data Selection: Unlike traditional methods that rely on human-annotated data or supervised fine-tuning, AutoMathText uses meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content.
Continual Pretraining: The method continuously pretrains a 7B-parameter Mistral model on the curated AutoMathText dataset, which is over 200GB in size.
Efficiency Improvements: The approach demonstrates a 2 times increase in pretraining token efficiency compared to baselines, making it more cost-effective and scalable.

Technical Details

Dataset Curation:
- Data Sources: The AutoMathText dataset is compiled from various sources, including academic papers, textbooks, and online resources.
- Selection Process: Meta-prompted language models are used to autonomously evaluate the quality of mathematical content. These models generate prompts that guide the selection process, ensuring only high-quality data is included.
- Dataset Size: The curated dataset totals over 200GB of data, making it one of the largest open-source datasets for mathematical texts.
Model Architecture:
- Base Model: A 7B-parameter Mistral language model serves as the base for continual pretraining.
- Meta-Prompting: Meta-prompted language models are used to generate prompts that help in evaluating and selecting data. These prompts are designed to test the mathematical reasoning capabilities of the content.
- Pretraining Strategy: The model is continuously pretrained on the AutoMathText dataset, with a focus on improving its performance on downstream tasks.

Performance Metrics:
- Downstream Tasks: The model was tested on the MATH dataset, which is a benchmark for evaluating mathematical reasoning in language models.
- Token Efficiency: Compared to previous continuous pretraining works, AutoMathText achieves a token reduction by orders of magnitude. This translates to a 2 times increase in pretraining token efficiency.
- Performance Gains: The model shows substantial improvements on the MATH dataset, demonstrating enhanced mathematical reasoning capabilities.

Implementation Notes

Code and Data Availability:
- The AutoMathText dataset is available on Hugging Face at this link.
- The code for implementing the autonomous data selection and continual pretraining process is available on GitHub at this link.

Conclusion

AutoMathText represents a significant advancement in enhancing language models' mathematical reasoning capabilities. By leveraging autonomous data selection and continual pretraining, this approach not only improves model performance but also does so more efficiently. The availability of the AutoMathText dataset and code makes it easier for researchers to build upon this work and further advance the field.