TLDR: Researchers discovered that the optimal training of large language models (LLMs) across different model and dataset sizes is guided by a single invariant: the operator norm of the output layer. This “norm transfer” provides a necessary condition for optimal performance. They also found specific scaling rules for learning rate and batch size that act as sufficient conditions, consistent with other optimizers like Adam, and identified optimal per-layer learning rates for further performance gains.
Training Large Language Models (LLMs) efficiently is a monumental challenge, often requiring extensive fine-tuning of hyperparameters as models and datasets grow. Despite significant progress, a unified principle explaining optimal hyperparameter transfer across different scales has remained elusive. A recent research paper, “OPTIMALSCALINGNEEDSOPTIMALNORM,” by Oleg Filatov, Jiangtao Wang, Jan Ebert, and Stefan Kesselheim from the Jülich Supercomputing Centre, sheds new light on this complex problem, proposing a groundbreaking concept they term “norm transfer.”
The core discovery of this work is that the joint optimal scaling across both model and dataset sizes is governed by a single, unchanging factor: the operator norm of the output layer. This means that regardless of how large a model becomes (up to 1.3 billion parameters) or how vast the dataset it’s trained on (up to 138 billion tokens), the ideal combination of learning rate and batch size consistently results in the same operator norm value for the output layer. This phenomenon, “norm transfer,” acts as a necessary condition for achieving optimal performance.
The Scion Optimizer and Norm-Based Optimization
The researchers utilized the Scion optimizer, a tool that reframes optimization as a process of controlling the operator norms of a model’s weight matrices and gradient updates. This norm-based perspective allows for a deeper understanding of training dynamics beyond just monitoring the loss curve. By tracking and analyzing layer norms across thousands of experiments, they were able to pinpoint this unifying invariant.
While the constant output layer norm is a crucial indicator, it’s not the sole determinant of optimality. Many different learning rate and batch size pairs can achieve this optimal norm, but only a unique pair will lead to the best possible loss. To address this, the paper provides the first empirical measurements of the optimal learning rate and batch size scaling with dataset size for the Scion optimizer. They found that the optimal learning rate (η*) scales with batch size (B) and dataset size (D) approximately as B^0.62 · D^−0.56. Furthermore, the optimal batch size itself scales with dataset size as D^0.45±0.07, leading to an optimal learning rate scaling with dataset size as D^−0.28±0.07. These scaling rules are remarkably consistent with those observed for the widely used Adam optimizer.
Per-Layer Learning Rates and Practical Insights
Beyond global hyperparameter tuning, the study also explored the impact of setting different learning rates for different groups of layers within the model. They discovered that tuning per-layer-group learning rates can improve model performance by up to 6% in relative loss. Specifically, a learning rate ratio of 1:1/8:1 for the input, hidden, and output layers, respectively, was found to be consistently optimal across various dataset and batch sizes. The output layer was identified as the most sensitive to tuning, with sensitivity progressively decreasing for hidden and then input layers.
The “norm-everywhere” approach, where the input to every linear layer is normalized, played a significant role in these findings. This consistent treatment of norms throughout the model, combined with the Scion optimizer’s design, appears to contribute to the observed norm transfer phenomenon, even across model depth scaling, which is often challenging to achieve.
Also Read:
- Optimizing LLM Reasoning: The Critical Role of Training Data and Test-Time Compute
- Finding the Sweet Spot: How Training Noise Shapes Model Merging
Implications and Future Directions
The findings presented in this paper offer practical insights for guiding optimal scaling in LLM training. The concept of “norm transfer” provides a necessary condition for hyperparameter selection, while the derived scaling rules offer a sufficient condition for achieving the best loss. The researchers have also open-sourced their Distributed Scion (Disco) implementation and extensive training logs to foster further research into LLM training dynamics at scale. You can find more details in the full research paper: OPTIMALSCALINGNEEDSOPTIMALNORM.
This work opens up several intriguing questions for future exploration, such as why the optimal norm transfers, the underlying reasons for the observed scaling rules, and whether this phenomenon is specific to the output layer or the Scion optimizer. Nevertheless, this study represents a significant step towards a more unified and principled understanding of optimal scaling in the era of ever-growing LLMs.


