TLDR: A new post-training strategy called Ban&Pick significantly improves the accuracy and inference speed of Mixture-of-Experts (MoE) Large Language Models (LLMs) without requiring retraining or architectural changes. The ‘Pick’ module identifies and reinforces highly influential ‘key experts’ to boost performance, while the ‘Ban’ module dynamically prunes redundant experts based on layer and token sensitivity to accelerate inference. Experiments on models like DeepSeek and Qwen3 demonstrate notable accuracy gains (e.g., 3.99% on AIME2024 for Qwen3-30B-A3B) and inference speedups (up to 1.27x) by making expert routing smarter.
Large Language Models (LLMs) are becoming increasingly powerful, but their sheer size often makes them challenging to run efficiently. A popular solution is the Mixture-of-Experts (MoE) architecture, which allows LLMs to scale by activating only a small subset of specialized ‘experts’ for each input. While fine-grained MoE designs, like those in DeepSeek and Qwen3, have introduced hundreds of experts per layer, a new study reveals that their full potential for performance and efficiency is often held back by how experts are chosen during pre-training.
Researchers have identified two key issues in current MoE routing strategies. Firstly, some highly influential experts are not fully utilized because routing decisions are made too early and are designed to ensure balanced usage across all experts. This means the most capable experts might be overlooked. Secondly, forcing a fixed number of experts to be active for every input token can lead to significant redundancy, as many activated experts contribute very little to the final output.
Instead of requiring complex retraining or architectural changes, a new post-training strategy called Ban&Pick has been introduced. This plug-and-play method aims to make MoE routing smarter, leading to better performance and faster inference. You can read the full research paper here: Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs.
The ‘Pick’ Module: Reinforcing Key Experts
The ‘Pick’ module focuses on identifying and enhancing the influence of ‘key experts.’ These are a small group of experts that have an outsized impact on the model’s performance. The researchers found that within the most frequently selected experts, only a tight subset truly makes a decisive difference. By analyzing how much the output distribution shifts when an expert is removed, and how much accuracy improves when an expert is forcibly included, ‘Pick’ can pinpoint these critical experts.
Once identified, ‘Pick’ reinforces these key experts during routing. Experiments showed that simply forcing these experts to be chosen, especially when they are already close to being selected by the router, significantly boosts accuracy across various domains like math, code, and general reasoning. For instance, on the Qwen3-30B-A3B model, applying ‘Pick’ alone led to an average 2.83% performance improvement.
The ‘Ban’ Module: Pruning Redundant Experts
Complementing ‘Pick’ is the ‘Ban’ module, which tackles the issue of redundant expert activation. It dynamically prunes experts that contribute little to the final output, thereby accelerating inference with minimal accuracy loss. ‘Ban’ achieves this by considering two factors: layer sensitivity and token sensitivity.
Layer sensitivity acknowledges that different layers in an MoE model respond differently to expert pruning; some layers are much more robust than others. Token sensitivity recognizes that during reasoning, some tokens concentrate their routing weights on a few experts (making them robust to pruning), while others distribute weights more evenly (making them more sensitive). By combining these insights, ‘Ban’ can intelligently reduce the number of active experts per token, leading to substantial speedups.
For example, on Qwen3-30B-A3B, ‘Ban’ achieved a 1.25x inference speedup while limiting accuracy loss to within 1.5% on most datasets, and even within 1% on more than half of the results.
Also Read:
- COMPACT: A Dual Pruning Strategy for Efficient and Deployable Large Language Models
- Enhancing Language Model Reasoning with Dynamic Confidence Assessment
Combined Power: Ban&Pick
When ‘Ban’ and ‘Pick’ are combined, the unified framework delivers both accuracy gains and inference acceleration. On Qwen3-30B-A3B, the Ban&Pick strategy achieved an average accuracy improvement of 1.99% and a 1.25x speedup. For the larger Qwen3-235B-A22B, it yielded an average accuracy gain of 1.33% and a 1.26x speedup. These results demonstrate that by simply routing experts more intelligently, fine-grained MoE models can achieve stronger performance and faster inference without the need for costly retraining or architectural modifications.
The study also explored how key experts from different domains interact. While there can be some interference when enhancing multiple domains simultaneously, the combined enhancement still provides clear benefits over the baseline, confirming the robustness and crucial role of these key experts in MoE models. This research opens new avenues for optimizing MoE-LLMs, suggesting future work could focus on automatic key expert selection and more sophisticated enhancement strategies.


