Optimal Learning Rate Scaling for LLMs Across Token Horizons

Models & Research

The Engineer

3 Oct 2024 · 3 min read

Researchers explore how the optimal learning rate for large language models changes with varying token lengths, offering new guidelines for training across different text complexities.

In the realm of large language models (LLMs), scaling is a critical factor that influences performance. While much attention has been paid to scaling model size, dataset size, and cluster size, the impact of token horizon on hyperparameters-specifically the learning rate (LR)-has been overlooked until now. A recent paper by Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, and Xia Song delves into this gap, providing valuable insights for practitioners.

What Changed Technically

The authors conducted a large-scale empirical study to understand how the optimal learning rate (LR) changes with token horizon in LLM training. Token horizon refers to the total number of tokens seen during training. The key findings are:

Significant LR Change: The optimal LR decreases significantly as the token horizon increases. This means that longer training runs require smaller LRs.
Scaling Law: The optimal LR follows a predictable scaling law, allowing for accurate estimation from shorter horizons to longer ones.
Zero Overhead Transfer: They provide a rule-of-thumb for transferring LR across different token horizons without incurring additional overhead.

Why It Matters

For practitioners, this research is crucial because:

Economic Feasibility: Extensive hyperparameter tuning for large-scale runs is economically prohibitive. The ability to transfer optimal LRs from smaller experiments can save significant computational resources.
Performance Optimization: Using the wrong LR can lead to suboptimal performance or even training instability. By understanding how LR scales with token horizon, practitioners can avoid these pitfalls.

Key Findings and Implementation Details

Optimal LR Change:
- The study demonstrates that as the token horizon increases, the optimal LR decreases. This is because longer training runs require more fine-tuned updates to avoid overshooting the minimum loss.
- For example, an optimal LR of (10^{-3}) for a short horizon might need to be reduced to (10^{-4}) for a much longer horizon.
Scaling Law:
- The authors found that the optimal LR follows a scaling law of the form: [ \text{Optimal LR} = k \cdot (\text{Token Horizon})^{-\alpha} ] where (k) and (\alpha) are constants determined empirically.
- This relationship allows for accurate prediction of the optimal LR for longer horizons based on data from shorter runs.
Rule-of-Thumb:
- The paper provides a simple rule-of-thumb for practitioners to transfer LRs across different token horizons: [ \text{New Optimal LR} = \left(\frac{\text{New Token Horizon}}{\text{Old Token Horizon}}\right)^{-\alpha} \cdot \text{Old Optimal LR} ]
- This formula can be applied with minimal computational overhead, making it practical for large-scale training.

Case Study: LLama-1

The authors also analyze the performance of LLama-1, a well-known LLM. They argue that LLama-1 used an excessively high LR, leading to suboptimal performance. By estimating the performance hit from this mistake, they highlight the importance of proper hyperparameter tuning.

Conclusion

This research underscores the significance of considering token horizon when setting hyperparameters for LLM training. The provided scaling laws and rule-of-thumb offer practical guidance for practitioners, enabling them to optimize LR settings with minimal overhead. As LLMs continue to grow in size and complexity, these insights will be invaluable for achieving optimal performance.