TLDR: This research paper analyzes how different normalization schemes (like Post-LN, Pre-LN, Peri-LN) affect token representations in deep transformers. By modeling token evolution as interacting particles, the authors show that normalization acts as a “speed regulator.” They find that schemes like Peri-LN and nGPT offer faster initial token movement, while Pre-LN, Mix-LN, and nGPT resist “representation collapse” in deeper layers more effectively than Post-LN. Peri-LN is highlighted as a particularly effective choice for balancing early and deep layer utility.
Deep learning models, particularly transformers, have revolutionized fields from natural language processing to protein folding. At the heart of their remarkable capabilities lies the attention mechanism, but a less-talked-about component, layer normalization (LayerNorm), plays a critical role in shaping how these models process information across their many layers.
A new research paper, “NORMALIZATION IN ATTENTION DYNAMICS,” delves into the intricate effects of various normalization schemes on the internal representations of tokens within deep transformers. The authors, Nikita Karagodin, Shu Ge, Yury Polyanskiy, and Philippe Rigollet, propose a novel perspective: viewing the evolution of token representations as interacting particles moving on a sphere. From this vantage point, normalization schemes are reinterpreted as a form of “speed regulation” for these particles.
Understanding the Dynamics of Normalization
The study provides a unified framework to analyze several prominent normalization schemes, including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling. Each of these schemes, while seemingly subtle in their implementation, profoundly influences how token representations cluster or, conversely, suffer from “representation collapse”—a phenomenon where deep layers of large language models (LLMs) become near-identity transformations, losing their ability to meaningfully transform data.
For instance, Post-LayerNorm (Post-LN) has been a standard for theoretical analysis, constraining particles to evolve on a sphere. However, Pre-LayerNorm (Pre-LN) has emerged as the default for leading LLMs like GPT and LLaMA, known for enabling more stable training of deeper networks and reducing sensitivity to hyperparameters. Other innovative approaches include Mix-LN, which combines Post-LN in early layers with Pre-LN in deeper ones, and Peri-LN, a refinement of Mix-LN reportedly used in models like Gemma-3. LN-Scaling and nGPT offer further variations, each with unique implications for token dynamics.
Also Read:
- Dynamic Dropout: A Game of Life Approach to Neural Network Regularization
- Unlocking Low-Precision Training: A New Theory for Adaptive Optimizer Convergence
Speed Regulation and Representation Collapse
The paper’s core insight is that by focusing on the “direction” of token representations, all normalization rules can be seen as interacting particle systems on a sphere, sharing a common velocity field but subject to distinct, rule-dependent speed-regulation mechanisms. This model, despite its simplicity, effectively captures complex behaviors observed in practice, such as the “curse of depth” and representation collapse.
The researchers analyze both the initial and terminal velocities of tokens, which are crucial for understanding how effectively each layer contributes to shaping the final representation. An efficient architecture should ensure that early layers make significant transformations, while also preventing tokens from collapsing too quickly in deeper layers. The study reveals that Peri-LN and nGPT (with specific parameter choices) allow tokens to move faster in early layers, making better use of initial processing. Conversely, Pre-LN, Mix-LN, and nGPT (with constant alpha) exhibit a polynomial slowdown in terminal velocity, meaning they cluster more gradually and are more resistant to representation collapse in very deep models compared to Post-LN, which clusters tokens much more aggressively.
Ultimately, the research identifies Peri-LN as a particularly effective scheme, demonstrating a strong balance by facilitating substantial token movement in early layers while also mitigating representation collapse in deeper layers. The nGPT scheme also offers similar benefits, with the added advantage of trainable parameters to control its behavior.
While the study offers a powerful theoretical lens, the authors acknowledge limitations, such as simplifying assumptions about weight matrices and the omission of MLP layers. Future work aims to address these complexities, including a companion paper on gradient-flow analysis. This research provides a principled basis for comparing normalization schemes and offers concrete guidelines for designing more effective transformer architectures. You can read the full paper here: NORMALIZATION IN ATTENTION DYNAMICS.


