TLDR: A new research paper introduces REAP, a pruning method for Mixture-of-Experts (SMoE) models that significantly outperforms expert merging for generative AI tasks like code generation and creative writing. The paper proves that merging causes an “irreducible error” by disrupting the model’s internal control over experts, leading to a “functional subspace collapse.” REAP, which considers both router gate-values and expert activation norms, achieves near-lossless compression, making large SMoE models more memory-efficient without sacrificing performance on real-world generative applications.
Large Language Models (LLMs) are becoming increasingly powerful, with many leveraging a sophisticated architecture known as Sparsely-activated Mixture-of-Experts (SMoE). These models offer benefits like efficient pre-training and lower latency, but their massive parameter counts lead to significant memory demands. This challenge has spurred research into methods for compressing these experts without compromising performance.
Historically, expert compression efforts have explored various techniques, including quantization, low-rank compression, and two primary strategies: expert pruning and expert merging. Recent findings, particularly on discriminative benchmarks like perplexity and multiple-choice question answering, have often favored expert merging. However, a new research paper titled REAPTHEEXPERTS: WHYPRUNINGPREVAILS FOR ONE-SHOTMOECOMPRESSION, authored by Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa, challenges this notion, especially for generative tasks.
The paper argues that for generative tasks – such as code generation, creative writing, and mathematical reasoning – expert pruning is a superior strategy. The core of their argument lies in a phenomenon they call “functional subspace collapse” caused by expert merging. They prove that merging introduces an irreducible error because it removes the router’s independent, input-dependent control over individual experts. Essentially, when experts are merged, the model loses the ability to dynamically mix and modulate the outputs of the original, distinct experts based on the input, forcing it to approximate a dynamic target with a static one.
In contrast, expert pruning, which involves removing entire experts, preserves the router’s independent control over the remaining experts. This distinction is crucial for maintaining the model’s functional output space and its ability to generate diverse and coherent responses.
Introducing REAP: A Novel Pruning Criterion
Leveraging this insight, the researchers propose Router-weighted Expert Activation Pruning (REAP). This novel pruning criterion considers both the router’s gate-values (which determine how much an expert is activated) and the expert activation norms (the magnitude of an expert’s output). By combining these factors, REAP identifies and prunes experts that contribute minimally to the layer’s output, ensuring that the most impactful experts are retained.
Across a diverse range of SMoE models, from 20 billion to 1 trillion parameters, REAP consistently outperformed both expert merging and other pruning methods on generative benchmarks. This advantage was particularly pronounced at 50% compression ratios. Notably, REAP achieved near-lossless compression on demanding tasks like code generation and tool-calling, even after removing half of the experts from models like Qwen3-Coder-480B and Kimi-K2.
Also Read:
- Optimizing LLM Compression: The Surprising Efficacy of Local Reconstruction
- Reversible Model Merging: Preserving Performance in Low-Rank Compressed Models
Why Merging Falls Short for Generative AI
The paper provides empirical evidence for the functional subspace collapse. Visualizations of expert activations show that pruning maintains the geometric structure of the original expert manifold, albeit with reduced density. Merging, however, causes a visible contraction towards the manifold’s center, especially in later layers where experts are more specialized. This contraction signifies a dramatic reduction in functional diversity.
Further analysis revealed that merged models produced significantly less diverse N-gram outputs and had higher perplexity compared to pruned models, indicating a divergence from the original model’s generation quality. The challenges of expert merging are attributed not only to the loss of router control but also to the non-local nature of expert merging and the high cardinality of expert clusters, making it difficult to coherently combine their parameters.
The research also highlights the critical importance of domain-specific calibration data for effective compression, particularly at higher compression ratios. Models calibrated with data relevant to the target domain showed significantly higher accuracy than those calibrated with general datasets.
In conclusion, while expert merging might suffice for discriminative tasks, it fundamentally impairs the auto-regressive generation capabilities required for generative tasks. REAP, by preserving the crucial coordination between the router and experts, offers a robust and scalable solution for compressing SMoE models, making them more efficient for real-world generative AI applications. The authors have open-sourced their code and select compressed model checkpoints to facilitate further research, which can be found at the research paper link.


