TLDR: A new research paper introduces the ‘isotropic curvature model’ to analyze deep learning optimization, particularly the Muon optimizer. It finds that leveraging the matrix structure of weights is beneficial, and the optimal update involves making gradient singular values more homogeneous. Full gradient orthogonalization, as used in Muon, is shown to be optimal only under extreme curvature growth conditions, suggesting that while Muon is directionally correct, it may not be strictly optimal in all practical scenarios.
For years, the world of deep learning optimization was dominated by Adam, an algorithm so prevalent it became almost synonymous with training neural networks. Introduced in 2014, Adam held its ground against numerous challengers, even receiving a prestigious Test of Time Award in early 2025. However, this long-standing reign was challenged in late 2024 by a new contender: Muon.
Muon quickly demonstrated superior performance, initially on smaller language models, and then impressively scaled to large, industry-grade models. By February 2025, it was shown to require significantly fewer computational resources (FLOPs) than AdamW to achieve comparable performance on a 16-billion-parameter language model. This rapid ascent suggests Muon is becoming the new standard for training language models, a remarkable feat in less than a year.
What makes Muon so effective? Its core insight is surprisingly intuitive yet mathematically profound: deep learning weights are often structured as matrices (e.g., in feedforward layers or attention mechanisms). Unlike Adam, which treats all weights as a single long vector, Muon acknowledges and leverages this matrix structure. When updating weights, Muon doesn’t follow the raw gradient direction. Instead, it uses an “orthogonalized gradient.” This involves taking the Singular Value Decomposition (SVD) of the gradient matrix and using only its unitary components (U and V transpose) for the update, effectively discarding the singular values (magnitudes).
The immediate question that arises is: why discard the singular values? A larger singular value typically indicates a more promising direction for reducing the loss. Making all singular values uniform, or even discarding them, seems counter-intuitive. This puzzle, along with the question of why the singular spaces of the original gradient should be preserved, has spurred a wave of research to understand Muon’s underlying mechanisms.
The Isotropic Curvature Model
To shed light on these questions, a new research paper, “Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?” by Weijie Su from the University of Pennsylvania, introduces a novel framework: the isotropic curvature model. This model aims to analyze deep learning optimization over a single iteration by focusing on the matrix structure of weights and making a key assumption: the curvature of the loss function (including second-order Hessian and higher-order terms) is ‘isotropic’ or uniform across all perturbation directions. This assumption is motivated by the immense scale of deep learning models, where no single direction is inherently favored, and by averaging over large batches, curvature might exhibit a convergent behavior.
The model simplifies the complex higher-order information of the loss function into a single, increasing “curvature function” (H). By analyzing this simplified yet powerful model, the paper provides several crucial insights into matrix-gradient methods like Muon.
Key Findings: Alignment, Homogenization, and Orthogonalization
First, the isotropic curvature model justifies why the singular spaces of the original gradient matrix should be preserved in the update. This means that the optimal update matrix will share the same ‘directions’ (singular spaces) as the gradient, only modifying their ‘magnitudes’ (singular values).
More importantly, the model offers guidance on how the spectrum (singular values) of the gradient matrix should be modified. Under a general growth condition on the curvature function, the model shows that the optimal update matrix is achieved by making the singular values of the original gradient matrix more homogeneous – that is, making them closer in ratio while preserving their original ordering. This property is termed “spectrum homogenization.” This suggests that while Muon’s orthogonalization (making all singular values equal) is a step in the right direction, it might be an extreme form of this homogenization.
Finally, the paper investigates when full orthogonalization becomes truly optimal. It proves that if the curvature exhibits a “phase transition” – a sharp increase in its growth rate at a certain point – then setting all singular values to be equal (orthogonalization) becomes optimal in this asymptotic limit. This can be visualized as a “kink” in the curvature function, where its derivative suddenly jumps from a small to a large value.
Also Read:
- Understanding Learning Rate Stability in Neural Networks with Maximal Update Parametrization
- Proactive Training: Making Neural Networks Inherently Robust for Low-Bit Quantization
Implications for Deep Learning Optimizers
These findings suggest that Muon and similar matrix-gradient methods are conceptually sound because they respect the matrix structure of gradients. However, the isotropic curvature model implies that Muon’s full orthogonalization might not be strictly optimal in all practical scenarios. The optimal spectrum transformation is likely not perfectly uniform, as real-world curvature may not exhibit the extreme “kink” behavior required for strict orthogonalization optimality. Instead, a degree of homogenization is generally preferred.
The research also highlights the potential for “auto-preconditioned” methods, where high-order information can be leveraged without explicit computation of the Hessian, relying instead on the inherent structure and isotropic properties of large-scale deep learning systems. This opens avenues for designing new optimizers that first approximate the curvature function and then solve a convex optimization problem to find the optimal update matrix, potentially leading to more fine-grained auto-preconditioners than existing methods like Adam or Muon.
While the isotropic curvature model is currently phenomenological, it provides a powerful theoretical lens for understanding and designing next-generation deep learning optimizers, especially for large language models. Future work will focus on rigorously justifying its assumptions, connecting it to other phenomena in deep learning, and developing efficient, practical implementations for GPU environments.


