Understanding Transformer Stability: A Deep Dive into Layer Normalization Placement

TLDR: This paper theoretically analyzes how different Layer Normalization (LN) placements affect Transformer training stability. It shows that Pre-LN leads to unbounded hidden state growth and exploding gradients, while Peri-LN ensures controlled growth and stable gradients. The research introduces a residual step scaling method that further enhances stability and performance, validated by experiments on GPT-2 models.

Transformers have become the backbone of modern deep learning, powering advanced models in areas like language and vision. However, training these deep models can be notoriously unstable. A crucial component, Layer Normalization (LN), is widely used to improve stability, but its optimal placement within the Transformer architecture has largely been determined through trial and error.

A recent research paper, “Stability of Transformers under Layer Normalization”, by Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Krishna Kumar, and Markos A. Katsoulakis, delves into a principled study of how different LN placements affect the stability of Transformers during both the forward pass (hidden states) and backward pass (gradients).

The Challenge of Transformer Stability

The stability of Transformers is critical for effective training. Instabilities can manifest in two main ways: in the forward evaluation, where hidden states can grow uncontrollably, and in gradient backpropagation, where gradients can explode or vanish, hindering learning. The placement of Layer Normalization plays a significant role in mitigating these issues.

Historically, early Transformer designs used Post-LN, applying normalization after adding the residual connection. While simple, this often led to suboptimal performance and required careful optimization. Pre-LN, which places LN at the input of attention and feedforward modules, became standard due to improved performance. However, Pre-LN is known to produce excessively large hidden states, potentially leading to numerical instability.

More recently, Peri-LN emerged as an alternative, applying LN at both the input and output of modules. Empirical studies have shown Peri-LN to offer improved training stability and more regular hidden states, but a strong theoretical understanding has been lacking.

Theoretical Insights into Layer Normalization

The authors introduce a novel theoretical framework based on optimal control theory, using a continuous-time formulation of Transformer architectures. This approach allows them to analyze the properties of *trained* models, rather than just models at initialization, providing insights into whether training leads to stable or pathological behaviors.

Pre-LN: Unbounded Growth

The paper theoretically demonstrates that under Pre-LN, the hidden states of the model are generally unbounded in magnitude. This means that as the model processes information through its layers, the internal representations can grow arbitrarily large. This unbounded growth can severely degrade the quality of learned representations and, critically, lead to numerical instability during training, including exploding gradients. Even with standard practices like weight decay, the growth remains exponential, which is still problematic for very deep models.

Peri-LN: Controlled Growth and Stability

In stark contrast, Peri-LN offers a more controlled dynamic. The output normalization in Peri-LN restricts the magnitude of the internal computations, ensuring that the hidden states remain well-conditioned. The theory shows that for Peri-LN, hidden state magnitudes grow at most linearly with depth, and their variance grows at most quadratically. This controlled growth is vital for preserving information quality across many layers in deep networks.

Backward Stability: Preventing Gradient Explosions

Beyond hidden state growth, the paper also examines backward stability, which concerns how gradients propagate through the network. Gradient explosions can effectively halt training.

For Pre-LN, the analysis reveals that the local sensitivity of each block (how much the output changes with respect to the input) grows proportionally with the activations. When activations are large, this leads to exploding gradients for earlier layers, making training unstable.

Peri-LN, however, ensures stable backpropagation. The local sensitivity under Peri-LN is invariant to the magnitude of the activations. This means that even if activations become large, the gradients remain controlled, preventing the compounding effect that leads to explosions in deep networks.

Enhancing Stability with Scaled Residual Steps

Guided by their theoretical findings, the researchers propose a simple yet effective architectural modification: scaling the residual steps within Transformer blocks. By introducing a step size (Δt < 1) to the residual updates, both forward and backward stability can be further improved.

This scaling factor explicitly reduces how each sub-layer amplifies differences, leading to sharper bounds on hidden state growth and output uncertainty. For backward stability, it scales down the local sensitivity of each block, significantly mitigating the potential for gradient explosion. This modification comes at no additional computational or memory cost, making it a highly practical technique.

Experimental Validation

The theoretical findings are strongly supported by experimental results using GPT-2 models (GPT-2, GPT-2 Large, and GPT-2 XL) on the OpenWebText dataset. The experiments showed:

Peri-LN models consistently remained stable across all trials, even without specific hyperparameter tuning for Peri-LN. In contrast, Pre-LN models often diverged, even with tuned hyperparameters.
Combining Peri-LN with scaled residual steps (Δt = 0.1) yielded the best performance across various metrics, validating the improved generalization bounds predicted by the theory.
Scaled residual steps significantly improved the stability and performance of Pre-LN models, transforming unstable training into stable and effective learning.
The scaling effectively controlled the growth of hidden states, keeping their mean absolute value and variance substantially lower across layers.

Also Read:

A Principled Framework for Future Designs

This research provides a powerful framework for assessing the stability of Transformer architectures. It offers a systematic workflow to theoretically screen new architectural modifications before expensive empirical training. By understanding the fundamental mechanisms of stability, designers can make more informed choices, guiding the development of more robust and efficient deep learning models.