How Normalization Shapes Transformer Representations

TLDR: This research paper analyzes how different normalization schemes (like Post-LN, Pre-LN, Peri-LN) affect token representations in deep transformers. By modeling token evolution as interacting particles, the authors show that normalization acts as a “speed regulator.” They find that schemes like Peri-LN and nGPT offer faster initial token movement, while Pre-LN, Mix-LN, and nGPT resist “representation collapse” in deeper layers more effectively than Post-LN. Peri-LN is highlighted as a particularly effective choice for balancing early and deep layer utility.

Deep learning models, particularly transformers, have revolutionized fields from natural language processing to protein folding. At the heart of their remarkable capabilities lies the attention mechanism, but a less-talked-about component, layer normalization (LayerNorm), plays a critical role in shaping how these models process information across their many layers.

A new research paper, “NORMALIZATION IN ATTENTION DYNAMICS,” delves into the intricate effects of various normalization schemes on the internal representations of tokens within deep transformers. The authors, Nikita Karagodin, Shu Ge, Yury Polyanskiy, and Philippe Rigollet, propose a novel perspective: viewing the evolution of token representations as interacting particles moving on a sphere. From this vantage point, normalization schemes are reinterpreted as a form of “speed regulation” for these particles.

Understanding the Dynamics of Normalization

The study provides a unified framework to analyze several prominent normalization schemes, including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, and LN-Scaling. Each of these schemes, while seemingly subtle in their implementation, profoundly influences how token representations cluster or, conversely, suffer from “representation collapse”—a phenomenon where deep layers of large language models (LLMs) become near-identity transformations, losing their ability to meaningfully transform data.

For instance, Post-LayerNorm (Post-LN) has been a standard for theoretical analysis, constraining particles to evolve on a sphere. However, Pre-LayerNorm (Pre-LN) has emerged as the default for leading LLMs like GPT and LLaMA, known for enabling more stable training of deeper networks and reducing sensitivity to hyperparameters. Other innovative approaches include Mix-LN, which combines Post-LN in early layers with Pre-LN in deeper ones, and Peri-LN, a refinement of Mix-LN reportedly used in models like Gemma-3. LN-Scaling and nGPT offer further variations, each with unique implications for token dynamics.

Also Read:

Speed Regulation and Representation Collapse

The paper’s core insight is that by focusing on the “direction” of token representations, all normalization rules can be seen as interacting particle systems on a sphere, sharing a common velocity field but subject to distinct, rule-dependent speed-regulation mechanisms. This model, despite its simplicity, effectively captures complex behaviors observed in practice, such as the “curse of depth” and representation collapse.

The researchers analyze both the initial and terminal velocities of tokens, which are crucial for understanding how effectively each layer contributes to shaping the final representation. An efficient architecture should ensure that early layers make significant transformations, while also preventing tokens from collapsing too quickly in deeper layers. The study reveals that Peri-LN and nGPT (with specific parameter choices) allow tokens to move faster in early layers, making better use of initial processing. Conversely, Pre-LN, Mix-LN, and nGPT (with constant alpha) exhibit a polynomial slowdown in terminal velocity, meaning they cluster more gradually and are more resistant to representation collapse in very deep models compared to Post-LN, which clusters tokens much more aggressively.

Ultimately, the research identifies Peri-LN as a particularly effective scheme, demonstrating a strong balance by facilitating substantial token movement in early layers while also mitigating representation collapse in deeper layers. The nGPT scheme also offers similar benefits, with the added advantage of trainable parameters to control its behavior.

While the study offers a powerful theoretical lens, the authors acknowledge limitations, such as simplifying assumptions about weight matrices and the omission of MLP layers. Future work aims to address these complexities, including a companion paper on gradient-flow analysis. This research provides a principled basis for comparing normalization schemes and offers concrete guidelines for designing more effective transformer architectures. You can read the full paper here: NORMALIZATION IN ATTENTION DYNAMICS.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

How Normalization Shapes Transformer Representations

Understanding the Dynamics of Normalization

Speed Regulation and Representation Collapse

Gen AI News and Updates

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Dremio Launches ‘The Agentic Lakehouse’ for AI-Driven Data Management

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates