spot_img
HomeResearch & DevelopmentBalancing Speed, Loss, and Performance in LLM Training: An...

Balancing Speed, Loss, and Performance in LLM Training: An Optimizer Comparison

TLDR: A study compared AdamW, Lion, and Sophia optimizers for pre-training 2.7 billion parameter LLMs on a limited budget. Lion was fastest in training, Sophia achieved the lowest training loss, but AdamW consistently delivered the best performance on downstream tasks, making it the most balanced choice for practical applications. The research also validated hyperparameter transferability for Lion and Sophia using Maximal Update Parametrization.

In the rapidly evolving field of Large Language Models (LLMs), the efficiency and effectiveness of pre-training are paramount. Training these powerful models demands significant computational resources, making the choice of optimizer a critical factor in reducing training times and achieving high-performing models. A recent research paper delves into this crucial aspect, offering a comprehensive comparison of three prominent optimizers: AdamW, Lion, and Sophia.

The study, titled “Pre-Training LLMs on a budget: A comparison of three optimizers,” was conducted by Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, and Fabian K ¨ uch from the Fraunhofer Institute for Integrated Circuits IIS. Their work provides valuable insights for practitioners and researchers operating under computational constraints.

The Optimizers Under the Microscope

The researchers focused on three distinct optimizers:

  • AdamW: Considered the de-facto standard in deep learning, this optimizer is a modification of the popular Adam algorithm, incorporating weight decay regularization.
  • Lion: A more recent and simpler optimizer, Lion stands out due to its unusual origin – it was developed through an evolutionary search rather than human engineering.
  • Sophia: This optimizer employs second-order criteria, meaning it considers not just the slope of the loss function but also its curvature, while maintaining a computationally lean approach.

Methodology for a Fair Comparison

To ensure a robust and generalizable comparison, the study adopted a meticulous methodology. They trained LLMs with approximately 3 billion parameters, a size that still requires substantial GPU hours, using a fixed budget of 60 billion tokens. Two different base architectures were used: the GPT-2 family and the LLaMA family, to assess how optimizers perform across architectural variations.

A key innovation in their approach was the use of Maximal Update Parametrization (µP). This technique allowed them to tune relevant hyperparameters on much smaller proxy models (around 50 million parameters) and then transfer these optimal settings directly to the larger target models. This method is crucial for efficient hyperparameter tuning, especially when dealing with large models and limited compute. The researchers also empirically validated µP for Lion and Sophia, extending its known applicability beyond AdamW.

The models were trained on a slice of the publicly available SlimPajama dataset. Performance was evaluated using both training-related metrics, such as final training and validation losses, and wall clock time, as well as downstream evaluation scores on standard LLM benchmarks like ARC-Easy, ARC-Challenge, Hellaswag, and MMLU in a zero-shot setup.

Key Findings: A Balancing Act

The study revealed distinct trade-offs among the three optimizers:

  • Lion’s Speed: Lion demonstrated the fastest initial convergence and the shortest training duration, making it an attractive choice for scenarios where rapid iteration or computational efficiency is a top priority.
  • Sophia’s Loss Reduction: Sophia excelled in achieving the lowest training and validation loss, particularly when models were trained over multiple epochs with GPT architectures. This suggests Sophia’s strength in finding better local optima.
  • AdamW’s Downstream Performance: Despite Lion’s speed and Sophia’s lower loss, AdamW consistently delivered the highest accuracy across all downstream benchmarks. This reinforces AdamW’s practical value as a standard choice for achieving the best real-world application performance.

The research also noted that repeating epochs could enhance downstream performance, especially for AdamW. Architecturally, GPT-based models generally performed better with Sophia and AdamW, while LLaMA architectures showed better results with Lion. Interestingly, LLaMA models required approximately twice the learning rate values compared to GPT models, highlighting the importance of architecture-specific tuning.

Also Read:

Practical Implications for LLM Pre-Training

The findings provide clear guidance for practitioners. If computational efficiency and rapid experimentation are the primary goals, Lion is a strong contender. For achieving the lowest possible validation loss, particularly in multi-epoch training, Sophia proves to be highly effective. However, for maximizing performance on real-world downstream tasks, AdamW remains the most reliable and balanced choice.

This systematic comparison under constrained computational scenarios offers a valuable foundation for informed optimizer selection in LLM pre-training. For more detailed information, you can read the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -