TLDR: GateSkip is a new method that uses learnable gates in LLMs to skip computation for less important tokens at each layer. It achieves up to 15% compute savings with over 90% accuracy on reasoning tasks, and near 50% savings with baseline quality on instruction-tuned models, outperforming prior adaptive compute methods. It’s stable, compatible with other efficiency techniques, and provides insights into transformer information flow.
Large Language Models (LLMs) have revolutionized natural language processing, but their increasing size presents significant challenges for efficient deployment. These models typically apply the same amount of computation to every token at every layer, a process that can be wasteful, especially in environments with limited resources or strict latency requirements. Addressing this, researchers have introduced various adaptive compute methods, but many suffer from instability or require extensive retraining.
A new approach called GateSkip offers a lightweight and stable solution for optimizing LLM inference. Developed by Filipe Laitenberger, Dawid Kopiczko, Cees G.M. Snoek, and Yuki M. Asano, GateSkip introduces a simple residual-stream gating mechanism that allows for token-wise layer skipping in decoder-only LLMs. This means that not all parts of the model need to process every piece of information, leading to significant computational savings.
How GateSkip Works
At its core, GateSkip equips each Attention and MLP (Multi-Layer Perceptron) branch within a transformer layer with a small, learnable gate. This gate, consisting of a linear layer followed by a sigmoid activation, condenses the branch’s output before it re-enters the residual stream. During inference, tokens are ranked based on their gate values, and those deemed less important are skipped using a predefined budget for each layer. When a token is skipped, its hidden state and key-value cache entries are simply copied from the layer below, avoiding unnecessary computation.
One of GateSkip’s key advantages is its stability. Unlike many early-exit or router-based Mixture-of-Depths models that often require extensive retraining and can be unstable, GateSkip’s smooth, differentiable gates can be fine-tuned stably on top of already pretrained models. This design minimally perturbs existing representations and provides fine-grained control over computation at both the token and module levels.
Performance and Efficiency
Experiments with Llama 3.1 models (up to 8B parameters) and Gemma 2 2B models demonstrate GateSkip’s effectiveness. On long-form reasoning tasks, GateSkip achieved up to 15% compute savings while maintaining over 90% of the baseline accuracy. For instruction-tuned models, it even showed accuracy gains at full compute and matched baseline quality with nearly 50% compute savings. This performance is particularly notable in generative settings, where many prior adaptive compute methods tend to struggle.
The researchers also conducted ablation studies to understand the impact of different design choices. They found that using “vector-gates” (producing an H-dimensional output) and “per-layer vector-gates” (distinct gates for each module) yielded better results. Crucially, placing the gate after the module’s output, rather than before, proved essential for stable and effective learning.
Also Read:
- Optimizing Multi-stage Reasoning in Small Language Models with LiteStage
- Elastic-Cache: Smarter Decoding for Diffusion Language Models
Compatibility and Insights
GateSkip is designed to be compatible with other efficiency techniques, such as 4-bit quantization, speculative decoding, and structured pruning. When combined with 4-bit quantization, GateSkip maintained its effectiveness, with performance curves closely matching the 32-bit model. It also boosted log-likelihood performance when integrated with self-speculative decoding and remained stronger than pruned baselines when combined with structured pruning.
Beyond efficiency, the learned gates offer valuable insights into how transformers process information. Analysis of gate activations revealed consistent patterns: early layers tend to allocate more computation to beginning-of-sequence (BOS) tokens and punctuation, which appear to act as structural anchors. Deeper layers become more selective, focusing on content-bearing words. Interestingly, in sequences containing “forbidden requests,” tokens like “chemical weapon” and “please” received unusually high importance scores, suggesting that these gates could potentially serve as a tool for interpretability and safety, highlighting policy-relevant textual spans.
GateSkip represents a significant step forward in making large language models more efficient and deployable, offering substantial compute savings without sacrificing performance or stability. For more technical details, you can read the full research paper here.


