
Share
Researchers unveil "Step Law," an innovative framework for optimizing hyperparameters in large language model pretraining, reducing the need for extensive trial and error through rigorous empirical testing.
In a groundbreaking study, researchers from a diverse group of institutions have introduced the "Step Law," a universal framework for optimizing hyperparameters in large language model (LLM) pretraining. This empirical investigation, which involved training over 3,700 LLMs across 100 trillion tokens and consuming nearly one million NVIDIA H800 GPU hours, provides a principled approach to hyperparameter tuning that significantly reduces the complexity of the search process.
The key technical contribution is the empirical validation of the Step Law, which describes how optimal hyperparameters (specifically learning rate and batch size) scale with model size ( N ) and dataset size ( D ). This law provides a predictable and generalizable framework for hyperparameter optimization across different LLM architectures.
For practitioners, the Step Law offers several significant benefits:
The researchers conducted an extensive empirical study to establish the Step Law:

Power-Law Relationship:
Empirical Validation: The researchers validated the Step Law by comparing the performance of models trained with hyperparameters predicted by the law to those found via exhaustive search. The results showed that the Step Law predictions were within 0.094% of the best performance on the test set.
For practitioners working with LLMs, the Step Law provides a robust and efficient method for hyperparameter tuning:
The Step Law represents a significant advancement in the field of LLM pretraining by providing a universal framework for hyperparameter optimization. By reducing the complexity of the search process and offering generalizable insights, it empowers researchers to more efficiently train and deploy large language models across various tasks and datasets.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
10 March 2025
88 articles
Related Articles
Related Articles
More Stories