TLDR: PuzzleMoE is a new, training-free method for compressing large Mixture-of-Experts (MoE) AI models. It uses sparse expert merging to preserve both shared and specialized knowledge, and a bit-packed encoding scheme to embed masks and signs directly into weights for efficient, metadata-free inference. This allows up to 50% compression with minimal accuracy loss, significant inference speedup, and reduced memory footprint, making large MoE models more deployable.
Large language models (LLMs) have become incredibly powerful, and a key innovation in their development is the Mixture-of-Experts (MoE) architecture. These models achieve impressive scale by only activating a small portion of their “experts” for each input, making them efficient in terms of computation. However, a significant hurdle to their widespread use has been the substantial memory required to store all these expert parameters, especially as models grow larger, like Mixtral-8x7B or DeepSeek-MoE.
To tackle this memory challenge, researchers have explored various compression techniques. Earlier methods often involved “expert dropping,” where less important experts were removed entirely, or “expert merging,” which combined similar experts. While these approaches offered some relief, they frequently led to a noticeable drop in the model’s performance, sometimes discarding crucial knowledge in the process. The core issue is that experts contain a mix of shared, general knowledge and specialized knowledge unique to specific tasks or patterns. Existing methods struggled to preserve both simultaneously.
Introducing PuzzleMoE: A Smart Compression Solution
A new method called PuzzleMoE offers a promising solution to this problem. It’s a training-free compression technique for MoE models that aims for both high accuracy and efficient operation. PuzzleMoE stands out with two main innovations:
- Sparse Expert Merging: Instead of merging entire experts, PuzzleMoE takes a more granular approach. It identifies redundancy and specialization at the individual weight level within experts. It uses a clever “dual-mask” system – one mask to find shared knowledge between experts and another to pinpoint unique, critical parameters for each expert. This allows it to selectively merge only the redundant parts while carefully preserving the specialized capabilities.
- Bit-Packed Inference: Storing the binary masks and signs needed for this fine-grained merging could introduce its own overhead. PuzzleMoE cleverly avoids this by introducing a “bit-packed encoding scheme.” It reuses underutilized bits within the standard Bfloat16 data format (commonly used for LLM inference) to embed these masks and signs directly into the weight tensors. This eliminates the need for extra storage and, combined with a custom processing kernel, enables fast and memory-efficient inference on GPUs.
Also Read:
- BudgetMem: Smarter Memory for Efficient Long-Context AI
- Streamlining Mechanistic Interpretability with Accelerated Path Patching
Impressive Results and Efficiency
Extensive experiments have shown PuzzleMoE’s effectiveness across various MoE models, including Mixtral-8x7B, Deepseek-MoE, and Qwen-MoE. It can compress these models by up to 50% while largely maintaining their accuracy across a range of tasks. For instance, at a 50% compression ratio, PuzzleMoE outperformed previous MoE compression methods by up to 16.7% on the MMLU benchmark and achieved up to 1.28 times faster inference speed. The method is also remarkably efficient in its compression process, taking only minutes compared to hours or even being infeasible for other methods.
Furthermore, PuzzleMoE significantly reduces memory usage, allowing large models like Mixtral-8x7B, which might typically require two high-end GPUs, to run on a single one after compression. This makes advanced AI models more accessible for deployment in environments with limited resources. The design is also “task-agnostic,” meaning it doesn’t require specific calibration datasets for different tasks, simplifying its practical application.
The research paper, available at arxiv.org/pdf/2511.04805, details these innovations and results. PuzzleMoE represents a significant step forward in making large Mixture-of-Experts models more compact and efficient without sacrificing their powerful performance, paving the way for broader real-world applications.


