TLDR: This paper extends the Strong Lottery Ticket Hypothesis (SLTH) to Multi-Head Attention (MHA) mechanisms and Transformers. It proves that randomly initialized MHAs and Transformers (without normalization layers) contain high-performing subnetworks (strong lottery tickets) if their hidden dimensions are sufficiently large. The key insight involves reinterpreting the attention mechanism’s inner product to apply a ‘two-layers-for-one approximation’. Empirical validation confirms that approximation error decreases exponentially with hidden dimension size and is independent of input length. The theory also leads to a new weight initialization strategy that improves strong lottery ticket performance in practical Transformer models like GPT-2.
A new research paper delves into the fascinating concept of “strong lottery tickets” within the complex architecture of modern AI models, specifically focusing on the Multi-Head Attention (MHA) mechanisms found in Transformers. These powerful models are the backbone of many advanced language AI systems today.
The core idea, known as the Strong Lottery Ticket Hypothesis (SLTH), suggests that even in large, randomly initialized neural networks, there exist smaller, high-performing subnetworks—dubbed “strong lottery tickets”—that can achieve comparable accuracy to a fully trained, larger network, even without any additional training. This concept is incredibly appealing because it hints at the possibility of more compact and efficient AI models.
While the SLTH has been theoretically proven for various neural network types, its application to Transformers, particularly their Multi-Head Attention components, remained a mystery. The unique structure of attention mechanisms, involving inner products between query and key vectors, presented a challenge that existing SLTH theories couldn’t fully explain.
The authors of this paper, Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, and Masato Motomura, set out to bridge this theoretical gap. Their work introduces a novel analysis demonstrating the existence of strong lottery tickets within MHAs. They prove that if a randomly initialized MHA has a sufficiently large “hidden dimension” for its key and value components, it is highly probable to contain a strong lottery ticket that can accurately mimic any arbitrary MHA of the same input dimension.
The key to their breakthrough lies in reinterpreting the inner product between query and key vectors within the attention mechanism. They view this as a linear neural network, allowing them to apply a variation of a foundational argument in SLTH theory called the “two-layers-for-one approximation.” This technique essentially shows how a single target layer can be approximated by pruning two randomly initialized layers. What’s particularly interesting is that their approach for MHAs doesn’t require additional layers for approximation, unlike previous applications of this argument to fully-connected networks.
Building on this understanding for MHAs, the researchers further extended the SLTH to entire Transformer architectures, specifically those without normalization layers. This means that a randomly initialized Transformer, under certain conditions, also contains a strong lottery ticket capable of approximating an arbitrary Transformer with similar structural properties.
Also Read:
- Collaborative LLM Inference: Introducing Federated Attention for Edge Networks
- Streamlining Attention for Time Series Forecasting with Entropy Equality
Empirical Validation and Practical Insights
The theoretical findings were not just abstract proofs; they were also empirically validated through experiments. The researchers observed that the approximation error between the strong lottery ticket within a source model (MHA or Transformer) and its target counterpart decreased exponentially as the hidden dimension of the source model increased. This directly supports their theoretical claim that larger hidden dimensions are crucial for the existence of these efficient subnetworks.
Another significant finding was that the approximation error remained stable and did not increase even when the input sequence length grew. This is important for Transformer models, which often process long sequences of data.
Perhaps one of the most exciting practical implications of this research is a new weight initialization scheme. The theory suggested a non-conventional initialization where query and key projection weights are scaled differently. When tested with GPT-2 models on a language modeling task, this scaled initialization strategy led to better-performing strong lottery tickets, achieving lower validation loss and approaching the performance of fully trained models. This suggests that their theoretical insights can directly inform and improve the way we find efficient subnetworks in real-world AI applications.
This research significantly advances our understanding of overparameterized models and the Strong Lottery Ticket Hypothesis, extending its theoretical foundation to the critical Multi-Head Attention mechanisms and Transformer architectures. It opens new avenues for developing more compact and efficient AI models. You can read the full paper here.


