REG Optimizer: Enhancing Stability and Performance in Large Language Model Training

TLDR: The REG optimizer is a new method for training large language models (LLMs) that improves upon existing optimizers like AdamW and Muon. It replaces Muon’s unstable matrix sign function with a simpler, more robust Row-and-Column-Scaling (RACS) operator. REG balances gradient updates, leading to superior performance, greater stability, and better compatibility with AdamW-trained models, especially during fine-tuning. It has shown strong results in both LLM mathematical tasks and computer vision, with faster convergence and higher accuracy.

Optimizers are the unsung heroes behind the efficient training of large language models (LLMs), which are at the forefront of today’s AI advancements. While AdamW has long been the go-to standard, recent innovations have introduced structure-aware optimizers like Muon, designed to fine-tune gradient updates by working directly with entire weight matrices. Muon aims to balance gradient updates across all directions, but its reliance on a complex matrix sign function can lead to training instability and compatibility issues, especially when fine-tuning models initially trained with AdamW.

Addressing these critical limitations, researchers have introduced a novel optimizer called REG. This new approach replaces Muon’s aggressive matrix sign operator with a more gentle yet effective technique: the Row-and-Column-Scaling (RACS) operator. The RACS operator is theoretically rooted in the concept of balancing a matrix, allowing it to regularize update steps in a less drastic manner. This makes REG simpler to implement and more compatible with established training dynamics, particularly those of AdamW.

The core idea behind REG is to tackle the problem of ill-conditioned momentum matrices, which often arise in LLM training. When a momentum matrix is ill-conditioned, it means that a few principal directions dominate the parameter updates, hindering stable convergence. Instead of Muon’s computationally intensive matrix sign function, REG employs the RACS operator. This operator involves straightforward diagonal matrix multiplications, making it computationally efficient. It works by making the rows and columns of the momentum matrix more uniform in magnitude, thereby improving its conditioning.

REG integrates this RACS operator into the standard Gradient Descent with Momentum (GDM) framework. It also includes two practical enhancements crucial for large-scale training: weight decay to prevent overfitting and an RMS-based rescaling mechanism to ensure consistent update magnitudes. For the empirically preferred choice of L2-norm scaling, the researchers even derived a closed-form solution for the root mean square (RMS) of the normalized matrix, further boosting computational efficiency.

Extensive experiments on LLM training demonstrate REG’s capabilities. In mathematical reasoning tasks, REG not only achieved superior performance and stability compared to AdamW but also maintained consistency with the AdamW training paradigm. This consistency is particularly vital during the fine-tuning stage, where REG successfully avoids the performance degradation observed with Muon. For instance, on the MATH500 benchmark, REG achieved a remarkable 64.8% accuracy, outperforming all other optimizers. Similarly, in mathematical optimization modeling problems, REG consistently delivered the highest accuracy on most benchmarks, significantly improving performance on challenging datasets like OptMATH-Bench.

Beyond language tasks, REG’s effectiveness was also validated in computer vision. On the CIFAR-100 image classification task, REG achieved superior performance with ResNet-18 and ResNet-50 models compared to baselines like SGD, NGD, and Adam. Notably, REG demonstrated the fastest convergence in terms of both loss reduction and accuracy improvement, indicating its efficiency in quickly finding good parameter configurations early in the training process.

Ablation studies further refined the REG optimizer. It was found that a hybrid approach, where AdamW updates are used specifically for the embedding layers of LLMs while REG handles other parameters, consistently yielded better performance. This REG-with-AdamW configuration is now recommended for optimal results. Additionally, while theoretical results on matrix equilibration often support L1 or L-infinity norms, empirical investigations showed that using the L2-norm (p=2) for the RACS operator provided superior performance and computational efficiency due to its closed-form RMS solution.

Also Read:

In conclusion, the REG optimizer represents a significant step forward in robust training dynamics for large language models and beyond. By leveraging the computationally efficient and stable RACS operator, REG offers superior performance, enhanced stability, and better compatibility with existing training paradigms compared to previous structure-aware optimizers. This makes it a promising candidate for future supervised fine-tuning applications. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

REG Optimizer: Enhancing Stability and Performance in Large Language Model Training

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates