Step Law: A Universal Framework for Hyperparameter Optimization in Large Language Model Pretraining

Models & Research

The Engineer

10 Mar 2025 · 3 min read

Researchers unveil "Step Law," an innovative framework for optimizing hyperparameters in large language model pretraining, reducing the need for extensive trial and error through rigorous empirical testing.

In a groundbreaking study, researchers from a diverse group of institutions have introduced the "Step Law," a universal framework for optimizing hyperparameters in large language model (LLM) pretraining. This empirical investigation, which involved training over 3,700 LLMs across 100 trillion tokens and consuming nearly one million NVIDIA H800 GPU hours, provides a principled approach to hyperparameter tuning that significantly reduces the complexity of the search process.

What Changed Technically

The key technical contribution is the empirical validation of the Step Law, which describes how optimal hyperparameters (specifically learning rate and batch size) scale with model size ( N ) and dataset size ( D ). This law provides a predictable and generalizable framework for hyperparameter optimization across different LLM architectures.

Learning Rate Scaling: The optimal learning rate follows a power-law relationship with ( N ) and ( D ).
Batch Size Scaling: The optimal batch size is primarily influenced by ( D ) and remains largely invariant to ( N ).

Why It Matters

For practitioners, the Step Law offers several significant benefits:

Reduced Search Complexity: By understanding the convexity of the hyperparameter landscape, researchers can more efficiently find near-optimal settings without exhaustive search.
Generalizability: The law applies across different model architectures and data recipes, making it a versatile tool for LLM pretraining.
Performance Consistency: The estimated optima from the Step Law deviate by only 0.094% from the global best performance found via exhaustive search.

Key Findings

The researchers conducted an extensive empirical study to establish the Step Law:

Model and Data Scale: They trained models ranging in size from a few billion parameters to over 100 billion parameters, using datasets that varied in size from several billion tokens to 100 trillion tokens.
Hyperparameter Landscape: Under fixed ( N ) and ( D ), the hyperparameter landscape was found to be convex with a broad optimum. This means that there is a wide range of hyperparameters that yield near-optimal performance, reducing the need for fine-tuning.

Implementation Details

Power-Law Relationship:
- The optimal learning rate ( \eta ) can be expressed as: [ \eta = k_1 \cdot N^{a} \cdot D^{b} ] where ( k_1 ), ( a ), and ( b ) are constants determined empirically.
- The optimal batch size ( B ) can be expressed as: [ B = k_2 \cdot D^{c} ] where ( k_2 ) and ( c ) are constants.
Empirical Validation: The researchers validated the Step Law by comparing the performance of models trained with hyperparameters predicted by the law to those found via exhaustive search. The results showed that the Step Law predictions were within 0.094% of the best performance on the test set.

Practical Implications

For practitioners working with LLMs, the Step Law provides a robust and efficient method for hyperparameter tuning:

Initial Hyperparameter Selection: Use the power-law relationships to select initial learning rates and batch sizes.
Iterative Refinement: While the law provides a strong starting point, iterative refinement may still be necessary for specific use cases.

Conclusion

The Step Law represents a significant advancement in the field of LLM pretraining by providing a universal framework for hyperparameter optimization. By reducing the complexity of the search process and offering generalizable insights, it empowers researchers to more efficiently train and deploy large language models across various tasks and datasets.