Balancing Speed, Loss, and Performance in LLM Training: An Optimizer Comparison

TLDR: A study compared AdamW, Lion, and Sophia optimizers for pre-training 2.7 billion parameter LLMs on a limited budget. Lion was fastest in training, Sophia achieved the lowest training loss, but AdamW consistently delivered the best performance on downstream tasks, making it the most balanced choice for practical applications. The research also validated hyperparameter transferability for Lion and Sophia using Maximal Update Parametrization.

In the rapidly evolving field of Large Language Models (LLMs), the efficiency and effectiveness of pre-training are paramount. Training these powerful models demands significant computational resources, making the choice of optimizer a critical factor in reducing training times and achieving high-performing models. A recent research paper delves into this crucial aspect, offering a comprehensive comparison of three prominent optimizers: AdamW, Lion, and Sophia.

The study, titled “Pre-Training LLMs on a budget: A comparison of three optimizers,” was conducted by Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, and Fabian K ¨ uch from the Fraunhofer Institute for Integrated Circuits IIS. Their work provides valuable insights for practitioners and researchers operating under computational constraints.

The Optimizers Under the Microscope

The researchers focused on three distinct optimizers:

AdamW: Considered the de-facto standard in deep learning, this optimizer is a modification of the popular Adam algorithm, incorporating weight decay regularization.
Lion: A more recent and simpler optimizer, Lion stands out due to its unusual origin – it was developed through an evolutionary search rather than human engineering.
Sophia: This optimizer employs second-order criteria, meaning it considers not just the slope of the loss function but also its curvature, while maintaining a computationally lean approach.

Methodology for a Fair Comparison

To ensure a robust and generalizable comparison, the study adopted a meticulous methodology. They trained LLMs with approximately 3 billion parameters, a size that still requires substantial GPU hours, using a fixed budget of 60 billion tokens. Two different base architectures were used: the GPT-2 family and the LLaMA family, to assess how optimizers perform across architectural variations.

A key innovation in their approach was the use of Maximal Update Parametrization (µP). This technique allowed them to tune relevant hyperparameters on much smaller proxy models (around 50 million parameters) and then transfer these optimal settings directly to the larger target models. This method is crucial for efficient hyperparameter tuning, especially when dealing with large models and limited compute. The researchers also empirically validated µP for Lion and Sophia, extending its known applicability beyond AdamW.

The models were trained on a slice of the publicly available SlimPajama dataset. Performance was evaluated using both training-related metrics, such as final training and validation losses, and wall clock time, as well as downstream evaluation scores on standard LLM benchmarks like ARC-Easy, ARC-Challenge, Hellaswag, and MMLU in a zero-shot setup.

Key Findings: A Balancing Act

The study revealed distinct trade-offs among the three optimizers:

Lion’s Speed: Lion demonstrated the fastest initial convergence and the shortest training duration, making it an attractive choice for scenarios where rapid iteration or computational efficiency is a top priority.
Sophia’s Loss Reduction: Sophia excelled in achieving the lowest training and validation loss, particularly when models were trained over multiple epochs with GPT architectures. This suggests Sophia’s strength in finding better local optima.
AdamW’s Downstream Performance: Despite Lion’s speed and Sophia’s lower loss, AdamW consistently delivered the highest accuracy across all downstream benchmarks. This reinforces AdamW’s practical value as a standard choice for achieving the best real-world application performance.

The research also noted that repeating epochs could enhance downstream performance, especially for AdamW. Architecturally, GPT-based models generally performed better with Sophia and AdamW, while LLaMA architectures showed better results with Lion. Interestingly, LLaMA models required approximately twice the learning rate values compared to GPT models, highlighting the importance of architecture-specific tuning.

Also Read:

Practical Implications for LLM Pre-Training

The findings provide clear guidance for practitioners. If computational efficiency and rapid experimentation are the primary goals, Lion is a strong contender. For achieving the lowest possible validation loss, particularly in multi-epoch training, Sophia proves to be highly effective. However, for maximizing performance on real-world downstream tasks, AdamW remains the most reliable and balanced choice.

This systematic comparison under constrained computational scenarios offers a valuable foundation for informed optimizer selection in LLM pre-training. For more detailed information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Balancing Speed, Loss, and Performance in LLM Training: An Optimizer Comparison

The Optimizers Under the Microscope

Methodology for a Fair Comparison

Key Findings: A Balancing Act

Practical Implications for LLM Pre-Training

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates