TLDR: The REG optimizer is a new method for training large language models (LLMs) that improves upon existing optimizers like AdamW and Muon. It replaces Muon’s unstable matrix sign function with a simpler, more robust Row-and-Column-Scaling (RACS) operator. REG balances gradient updates, leading to superior performance, greater stability, and better compatibility with AdamW-trained models, especially during fine-tuning. It has shown strong results in both LLM mathematical tasks and computer vision, with faster convergence and higher accuracy.
Optimizers are the unsung heroes behind the efficient training of large language models (LLMs), which are at the forefront of today’s AI advancements. While AdamW has long been the go-to standard, recent innovations have introduced structure-aware optimizers like Muon, designed to fine-tune gradient updates by working directly with entire weight matrices. Muon aims to balance gradient updates across all directions, but its reliance on a complex matrix sign function can lead to training instability and compatibility issues, especially when fine-tuning models initially trained with AdamW.
Addressing these critical limitations, researchers have introduced a novel optimizer called REG. This new approach replaces Muon’s aggressive matrix sign operator with a more gentle yet effective technique: the Row-and-Column-Scaling (RACS) operator. The RACS operator is theoretically rooted in the concept of balancing a matrix, allowing it to regularize update steps in a less drastic manner. This makes REG simpler to implement and more compatible with established training dynamics, particularly those of AdamW.
The core idea behind REG is to tackle the problem of ill-conditioned momentum matrices, which often arise in LLM training. When a momentum matrix is ill-conditioned, it means that a few principal directions dominate the parameter updates, hindering stable convergence. Instead of Muon’s computationally intensive matrix sign function, REG employs the RACS operator. This operator involves straightforward diagonal matrix multiplications, making it computationally efficient. It works by making the rows and columns of the momentum matrix more uniform in magnitude, thereby improving its conditioning.
REG integrates this RACS operator into the standard Gradient Descent with Momentum (GDM) framework. It also includes two practical enhancements crucial for large-scale training: weight decay to prevent overfitting and an RMS-based rescaling mechanism to ensure consistent update magnitudes. For the empirically preferred choice of L2-norm scaling, the researchers even derived a closed-form solution for the root mean square (RMS) of the normalized matrix, further boosting computational efficiency.
Extensive experiments on LLM training demonstrate REG’s capabilities. In mathematical reasoning tasks, REG not only achieved superior performance and stability compared to AdamW but also maintained consistency with the AdamW training paradigm. This consistency is particularly vital during the fine-tuning stage, where REG successfully avoids the performance degradation observed with Muon. For instance, on the MATH500 benchmark, REG achieved a remarkable 64.8% accuracy, outperforming all other optimizers. Similarly, in mathematical optimization modeling problems, REG consistently delivered the highest accuracy on most benchmarks, significantly improving performance on challenging datasets like OptMATH-Bench.
Beyond language tasks, REG’s effectiveness was also validated in computer vision. On the CIFAR-100 image classification task, REG achieved superior performance with ResNet-18 and ResNet-50 models compared to baselines like SGD, NGD, and Adam. Notably, REG demonstrated the fastest convergence in terms of both loss reduction and accuracy improvement, indicating its efficiency in quickly finding good parameter configurations early in the training process.
Ablation studies further refined the REG optimizer. It was found that a hybrid approach, where AdamW updates are used specifically for the embedding layers of LLMs while REG handles other parameters, consistently yielded better performance. This REG-with-AdamW configuration is now recommended for optimal results. Additionally, while theoretical results on matrix equilibration often support L1 or L-infinity norms, empirical investigations showed that using the L2-norm (p=2) for the RACS operator provided superior performance and computational efficiency due to its closed-form RMS solution.
Also Read:
- Unlocking Efficient LLM Training: The Output Layer Norm as a Universal Scaling Guide
- Adaptive Sampling Enhances Stability and Efficiency in LLM Reinforcement Learning
In conclusion, the REG optimizer represents a significant step forward in robust training dynamics for large language models and beyond. By leveraging the computationally efficient and stable RACS operator, REG offers superior performance, enhanced stability, and better compatibility with existing training paradigms compared to previous structure-aware optimizers. This makes it a promising candidate for future supervised fine-tuning applications. For more details, you can refer to the full research paper here.


