Boosting MoE-LLM Performance and Speed with Smarter Expert Routing

TLDR: A new post-training strategy called Ban&Pick significantly improves the accuracy and inference speed of Mixture-of-Experts (MoE) Large Language Models (LLMs) without requiring retraining or architectural changes. The ‘Pick’ module identifies and reinforces highly influential ‘key experts’ to boost performance, while the ‘Ban’ module dynamically prunes redundant experts based on layer and token sensitivity to accelerate inference. Experiments on models like DeepSeek and Qwen3 demonstrate notable accuracy gains (e.g., 3.99% on AIME2024 for Qwen3-30B-A3B) and inference speedups (up to 1.27x) by making expert routing smarter.

Large Language Models (LLMs) are becoming increasingly powerful, but their sheer size often makes them challenging to run efficiently. A popular solution is the Mixture-of-Experts (MoE) architecture, which allows LLMs to scale by activating only a small subset of specialized ‘experts’ for each input. While fine-grained MoE designs, like those in DeepSeek and Qwen3, have introduced hundreds of experts per layer, a new study reveals that their full potential for performance and efficiency is often held back by how experts are chosen during pre-training.

Researchers have identified two key issues in current MoE routing strategies. Firstly, some highly influential experts are not fully utilized because routing decisions are made too early and are designed to ensure balanced usage across all experts. This means the most capable experts might be overlooked. Secondly, forcing a fixed number of experts to be active for every input token can lead to significant redundancy, as many activated experts contribute very little to the final output.

Instead of requiring complex retraining or architectural changes, a new post-training strategy called Ban&Pick has been introduced. This plug-and-play method aims to make MoE routing smarter, leading to better performance and faster inference. You can read the full research paper here: Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs.

The ‘Pick’ Module: Reinforcing Key Experts

The ‘Pick’ module focuses on identifying and enhancing the influence of ‘key experts.’ These are a small group of experts that have an outsized impact on the model’s performance. The researchers found that within the most frequently selected experts, only a tight subset truly makes a decisive difference. By analyzing how much the output distribution shifts when an expert is removed, and how much accuracy improves when an expert is forcibly included, ‘Pick’ can pinpoint these critical experts.

Once identified, ‘Pick’ reinforces these key experts during routing. Experiments showed that simply forcing these experts to be chosen, especially when they are already close to being selected by the router, significantly boosts accuracy across various domains like math, code, and general reasoning. For instance, on the Qwen3-30B-A3B model, applying ‘Pick’ alone led to an average 2.83% performance improvement.

The ‘Ban’ Module: Pruning Redundant Experts

Complementing ‘Pick’ is the ‘Ban’ module, which tackles the issue of redundant expert activation. It dynamically prunes experts that contribute little to the final output, thereby accelerating inference with minimal accuracy loss. ‘Ban’ achieves this by considering two factors: layer sensitivity and token sensitivity.

Layer sensitivity acknowledges that different layers in an MoE model respond differently to expert pruning; some layers are much more robust than others. Token sensitivity recognizes that during reasoning, some tokens concentrate their routing weights on a few experts (making them robust to pruning), while others distribute weights more evenly (making them more sensitive). By combining these insights, ‘Ban’ can intelligently reduce the number of active experts per token, leading to substantial speedups.

For example, on Qwen3-30B-A3B, ‘Ban’ achieved a 1.25x inference speedup while limiting accuracy loss to within 1.5% on most datasets, and even within 1% on more than half of the results.

Also Read:

Combined Power: Ban&Pick

When ‘Ban’ and ‘Pick’ are combined, the unified framework delivers both accuracy gains and inference acceleration. On Qwen3-30B-A3B, the Ban&Pick strategy achieved an average accuracy improvement of 1.99% and a 1.25x speedup. For the larger Qwen3-235B-A22B, it yielded an average accuracy gain of 1.33% and a 1.26x speedup. These results demonstrate that by simply routing experts more intelligently, fine-grained MoE models can achieve stronger performance and faster inference without the need for costly retraining or architectural modifications.

The study also explored how key experts from different domains interact. While there can be some interference when enhancing multiple domains simultaneously, the combined enhancement still provides clear benefits over the baseline, confirming the robustness and crucial role of these key experts in MoE models. This research opens new avenues for optimizing MoE-LLMs, suggesting future work could focus on automatic key expert selection and more sophisticated enhancement strategies.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting MoE-LLM Performance and Speed with Smarter Expert Routing

The ‘Pick’ Module: Reinforcing Key Experts

The ‘Ban’ Module: Pruning Redundant Experts

Combined Power: Ban&Pick

Gen AI News and Updates

Upwork Study Reveals AI Agents Thrive with Human Collaboration, Struggle Alone

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates