TLDR: A study compared AdamW, Lion, and Sophia optimizers for pre-training 2.7 billion parameter LLMs on a limited budget. Lion was fastest in training, Sophia achieved the lowest training loss, but AdamW consistently delivered the best performance on downstream tasks, making it the most balanced choice for practical applications. The research also validated hyperparameter transferability for Lion and Sophia using Maximal Update Parametrization.
In the rapidly evolving field of Large Language Models (LLMs), the efficiency and effectiveness of pre-training are paramount. Training these powerful models demands significant computational resources, making the choice of optimizer a critical factor in reducing training times and achieving high-performing models. A recent research paper delves into this crucial aspect, offering a comprehensive comparison of three prominent optimizers: AdamW, Lion, and Sophia.
The study, titled “Pre-Training LLMs on a budget: A comparison of three optimizers,” was conducted by Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, and Fabian K ¨ uch from the Fraunhofer Institute for Integrated Circuits IIS. Their work provides valuable insights for practitioners and researchers operating under computational constraints.
The Optimizers Under the Microscope
The researchers focused on three distinct optimizers:
- AdamW: Considered the de-facto standard in deep learning, this optimizer is a modification of the popular Adam algorithm, incorporating weight decay regularization.
- Lion: A more recent and simpler optimizer, Lion stands out due to its unusual origin – it was developed through an evolutionary search rather than human engineering.
- Sophia: This optimizer employs second-order criteria, meaning it considers not just the slope of the loss function but also its curvature, while maintaining a computationally lean approach.
Methodology for a Fair Comparison
To ensure a robust and generalizable comparison, the study adopted a meticulous methodology. They trained LLMs with approximately 3 billion parameters, a size that still requires substantial GPU hours, using a fixed budget of 60 billion tokens. Two different base architectures were used: the GPT-2 family and the LLaMA family, to assess how optimizers perform across architectural variations.
A key innovation in their approach was the use of Maximal Update Parametrization (µP). This technique allowed them to tune relevant hyperparameters on much smaller proxy models (around 50 million parameters) and then transfer these optimal settings directly to the larger target models. This method is crucial for efficient hyperparameter tuning, especially when dealing with large models and limited compute. The researchers also empirically validated µP for Lion and Sophia, extending its known applicability beyond AdamW.
The models were trained on a slice of the publicly available SlimPajama dataset. Performance was evaluated using both training-related metrics, such as final training and validation losses, and wall clock time, as well as downstream evaluation scores on standard LLM benchmarks like ARC-Easy, ARC-Challenge, Hellaswag, and MMLU in a zero-shot setup.
Key Findings: A Balancing Act
The study revealed distinct trade-offs among the three optimizers:
- Lion’s Speed: Lion demonstrated the fastest initial convergence and the shortest training duration, making it an attractive choice for scenarios where rapid iteration or computational efficiency is a top priority.
- Sophia’s Loss Reduction: Sophia excelled in achieving the lowest training and validation loss, particularly when models were trained over multiple epochs with GPT architectures. This suggests Sophia’s strength in finding better local optima.
- AdamW’s Downstream Performance: Despite Lion’s speed and Sophia’s lower loss, AdamW consistently delivered the highest accuracy across all downstream benchmarks. This reinforces AdamW’s practical value as a standard choice for achieving the best real-world application performance.
The research also noted that repeating epochs could enhance downstream performance, especially for AdamW. Architecturally, GPT-based models generally performed better with Sophia and AdamW, while LLaMA architectures showed better results with Lion. Interestingly, LLaMA models required approximately twice the learning rate values compared to GPT models, highlighting the importance of architecture-specific tuning.
Also Read:
- Evaluating Large Language Models for Argument Classification: A Deep Dive into Performance and Pitfalls
- Guiding Small Language Models to Reason with Cache Steering
Practical Implications for LLM Pre-Training
The findings provide clear guidance for practitioners. If computational efficiency and rapid experimentation are the primary goals, Lion is a strong contender. For achieving the lowest possible validation loss, particularly in multi-epoch training, Sophia proves to be highly effective. However, for maximizing performance on real-world downstream tasks, AdamW remains the most reliable and balanced choice.
This systematic comparison under constrained computational scenarios offers a valuable foundation for informed optimizer selection in LLM pre-training. For more detailed information, you can read the full research paper here.


