TLDR: A new research paper reveals why the Muon optimizer trains Large Language Models (LLMs) faster than Adam. The study found that Muon’s effectiveness stems from its ability to optimize the LLM’s associative memory components (Value and Output attention weights and Feed-Forward Networks). Muon’s update rule promotes a more balanced and ‘isotropic’ learning process, which is particularly beneficial for effectively learning from infrequent ‘tail classes’ in heavy-tailed datasets, a common characteristic of real-world data. Both empirical and theoretical analyses confirm that Muon’s approach leads to more uniform knowledge acquisition compared to Adam, which can struggle with imbalanced data.
A recent research paper titled “Muon Outperforms Adam in Tail-End Associative Memory Learning” by Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent Y. F. Tan sheds light on why the Muon optimizer consistently trains Large Language Models (LLMs) faster than the widely used Adam optimizer. While Muon’s empirical success has been clear, the underlying reasons for its superior performance have remained a mystery until now.
The researchers demystified Muon’s mechanism by looking at how LLMs store and retrieve information, a concept known as associative memory. They specifically investigated which parts of the transformer architecture, the backbone of LLMs, benefit most from Muon’s unique optimization approach.
Unveiling Muon’s Key Beneficiaries
Through careful experiments, the study found that Muon’s rapid convergence in validation loss is primarily due to its impact on two crucial components of LLMs: the Value and Output (VO) attention weights and the Feed-Forward Networks (FFNs). These components are considered the main ‘associative memory stores’ within the model, responsible for holding and recalling learned facts and knowledge. Interestingly, applying Muon only to these VO and FFN blocks almost fully recovered the performance gains seen when Muon was applied to the entire model, suggesting that other parts, like the Query and Key (QK) attention weights, benefit much less.
This finding is significant because it connects Muon’s success directly to how LLMs learn and store knowledge. Associative memories can be thought of as a collection of ‘facts’ represented as mathematical outer products. Muon’s update rule, which normalizes orthogonal factors of the gradient, effectively assigns equal importance to learning each of these ‘orthogonal facts’. This is crucial when dealing with real-world data, which often follows a ‘heavy-tailed’ distribution – meaning a few pieces of information (head classes) appear very frequently, while many others (tail classes) are rare.
Balanced Learning for Heavy-Tailed Data
The paper explains Muon’s superiority through two key properties. First, its update rule consistently leads to a more ‘isotropic’ singular spectrum compared to Adam. In simpler terms, this means Muon helps the model distribute its learning capacity more evenly across different directions, preventing it from over-focusing on dominant patterns. Second, as a direct result of this isotropic learning, Muon optimizes tail classes much more effectively than Adam when dealing with heavy-tailed data.
Empirical evidence from the study supports these claims. When analyzing the singular value spectra of weight matrices, Muon consistently produced more isotropic representations, indicating that knowledge, regardless of its frequency, was represented with comparable strength. Furthermore, in a knowledge-intensive question-answering task designed with heavy-tailed data, Muon matched Adam’s strong performance on frequent (head) classes but significantly outperformed Adam on rare (tail) classes. This led to faster and more uniform learning across all data frequencies, effectively narrowing the performance gap between head and tail knowledge.
Also Read:
- Dynamic Boosted Annealing: A New Approach to Efficient LLM Fine-Tuning
- Optimizing Large Language Model Training with Fine-Grained Data Management
Theoretical Confirmation
Beyond empirical observations, the researchers provided theoretical backing for their findings. By analyzing a simplified one-layer associative memory model under class-imbalanced data, they proved that Muon consistently achieves balanced learning across different classes, regardless of how features are embedded. In stark contrast, Adam’s learning performance was shown to be unstable and highly dependent on the properties of these embeddings, potentially leading to large disparities in learning errors between classes.
In essence, the research concludes that Muon’s core advantage lies in its update rule, which naturally aligns with the outer-product structure of linear associative memories. This alignment enables Muon to learn tail classes in heavy-tailed distributions more effectively and in a more balanced way than Adam. This deeper understanding of Muon’s mechanism could pave the way for even more efficient and robust LLM training in the future. For more details, you can read the full research paper here.


