Rethinking Expert Compression: Pruning's Edge in Generative AI

TLDR: A new research paper introduces REAP, a pruning method for Mixture-of-Experts (SMoE) models that significantly outperforms expert merging for generative AI tasks like code generation and creative writing. The paper proves that merging causes an “irreducible error” by disrupting the model’s internal control over experts, leading to a “functional subspace collapse.” REAP, which considers both router gate-values and expert activation norms, achieves near-lossless compression, making large SMoE models more memory-efficient without sacrificing performance on real-world generative applications.

Large Language Models (LLMs) are becoming increasingly powerful, with many leveraging a sophisticated architecture known as Sparsely-activated Mixture-of-Experts (SMoE). These models offer benefits like efficient pre-training and lower latency, but their massive parameter counts lead to significant memory demands. This challenge has spurred research into methods for compressing these experts without compromising performance.

Historically, expert compression efforts have explored various techniques, including quantization, low-rank compression, and two primary strategies: expert pruning and expert merging. Recent findings, particularly on discriminative benchmarks like perplexity and multiple-choice question answering, have often favored expert merging. However, a new research paper titled REAPTHEEXPERTS: WHYPRUNINGPREVAILS FOR ONE-SHOTMOECOMPRESSION, authored by Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa, challenges this notion, especially for generative tasks.

The paper argues that for generative tasks – such as code generation, creative writing, and mathematical reasoning – expert pruning is a superior strategy. The core of their argument lies in a phenomenon they call “functional subspace collapse” caused by expert merging. They prove that merging introduces an irreducible error because it removes the router’s independent, input-dependent control over individual experts. Essentially, when experts are merged, the model loses the ability to dynamically mix and modulate the outputs of the original, distinct experts based on the input, forcing it to approximate a dynamic target with a static one.

In contrast, expert pruning, which involves removing entire experts, preserves the router’s independent control over the remaining experts. This distinction is crucial for maintaining the model’s functional output space and its ability to generate diverse and coherent responses.

Introducing REAP: A Novel Pruning Criterion

Leveraging this insight, the researchers propose Router-weighted Expert Activation Pruning (REAP). This novel pruning criterion considers both the router’s gate-values (which determine how much an expert is activated) and the expert activation norms (the magnitude of an expert’s output). By combining these factors, REAP identifies and prunes experts that contribute minimally to the layer’s output, ensuring that the most impactful experts are retained.

Across a diverse range of SMoE models, from 20 billion to 1 trillion parameters, REAP consistently outperformed both expert merging and other pruning methods on generative benchmarks. This advantage was particularly pronounced at 50% compression ratios. Notably, REAP achieved near-lossless compression on demanding tasks like code generation and tool-calling, even after removing half of the experts from models like Qwen3-Coder-480B and Kimi-K2.

Also Read:

Why Merging Falls Short for Generative AI

The paper provides empirical evidence for the functional subspace collapse. Visualizations of expert activations show that pruning maintains the geometric structure of the original expert manifold, albeit with reduced density. Merging, however, causes a visible contraction towards the manifold’s center, especially in later layers where experts are more specialized. This contraction signifies a dramatic reduction in functional diversity.

Further analysis revealed that merged models produced significantly less diverse N-gram outputs and had higher perplexity compared to pruned models, indicating a divergence from the original model’s generation quality. The challenges of expert merging are attributed not only to the loss of router control but also to the non-local nature of expert merging and the high cardinality of expert clusters, making it difficult to coherently combine their parameters.

The research also highlights the critical importance of domain-specific calibration data for effective compression, particularly at higher compression ratios. Models calibrated with data relevant to the target domain showed significantly higher accuracy than those calibrated with general datasets.

In conclusion, while expert merging might suffice for discriminative tasks, it fundamentally impairs the auto-regressive generation capabilities required for generative tasks. REAP, by preserving the crucial coordination between the router and experts, offers a robust and scalable solution for compressing SMoE models, making them more efficient for real-world generative AI applications. The authors have open-sourced their code and select compressed model checkpoints to facilitate further research, which can be found at the research paper link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Rethinking Expert Compression: Pruning’s Edge in Generative AI

Introducing REAP: A Novel Pruning Criterion

Why Merging Falls Short for Generative AI

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates