TLDR: Merge-of-Thought (MoT) Distillation is a new framework for training smaller language models to learn complex reasoning from multiple ‘teacher’ AI models. Instead of relying on a single teacher, which often isn’t optimal, MoT alternates between training student models on individual teacher’s reasoning styles and then merging these student variants. This process distills a ‘consensus’ reasoning, overcoming conflicts and noise from diverse teachers. MoT significantly improves performance on math benchmarks with minimal data, mitigates forgetting, enhances general reasoning, and is robust to varied teacher qualities, offering an efficient way to transfer advanced reasoning to smaller, deployable models.
In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) are increasingly demonstrating impressive reasoning capabilities, often through what’s known as Chain-of-Thought (CoT) processes. However, transferring these complex reasoning skills efficiently to smaller, more deployable models has been a significant challenge. Traditional methods often rely on a single ‘oracle’ teacher model, but new research suggests this approach might be limiting.
A recent paper titled “Merge-of-Thought Distillation” by Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, and Junbo Zhao, introduces a novel framework designed to overcome these limitations. The researchers observed that there isn’t a universal ‘best teacher’ for all student models or even for the same student across different datasets. This insight led them to propose a method that unifies the reasoning abilities of multiple teachers, rather than painstakingly selecting just one.
Introducing Merge-of-Thought (MoT) Distillation
The core of their innovation is Merge-of-Thought Distillation (MoT), a lightweight framework that efficiently distills long CoT capabilities from diverse teachers into compact student models. MoT operates through an iterative process involving two main steps:
- Teacher-Specific Branch Supervised Fine-Tuning (SFT): In this step, the student model is branched, and each branch is fine-tuned on the reasoning rationales provided by a specific teacher. This allows the student to internalize the unique reasoning style of each individual teacher.
- Weight-Space Merging: After each branch has been trained, the parameters (weights) of these student variants are merged, typically by simple averaging. This crucial step distills a consensus, retaining reasoning features that are consistently reinforced across multiple teachers while smoothing out individual quirks or noise from any single teacher.
This alternating process of training and merging allows the student model to progressively condense the multi-teacher consensus reasoning, leading to a more robust and capable model.
Also Read:
- Navigating Complexity: How New AI Framework Guides LLMs to Smarter Reasoning
- Unlocking Deeper Logic in Language Models with Dynamic Rewards
Key Findings and Advantages
The research highlights several compelling advantages of the MoT framework:
- Superior Performance: Using only about 200 high-quality CoT samples, MoT applied to a Qwen3-14B student model surpassed the performance of strong models like DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1 on competition math benchmarks.
- Outperforms Single and Naive Multi-Teacher Methods: MoT consistently outperformed both the best single-teacher distillation and a naive approach of simply combining all multi-teacher data.
- Mitigates Catastrophic Forgetting: The framework helps reduce catastrophic forgetting, a common issue where models forget previously learned information when acquiring new skills. It also improves general reasoning beyond mathematics.
- Robustness: MoT demonstrated robustness to teachers with distribution-shifted reasoning styles and even peer-level teachers, meaning it can extract beneficial signals even from less-than-perfect or equally strong teachers.
- Cultivates Better Teachers: The MoT-merged student models, when used as teachers themselves, provided stronger distillation signals to new students, indicating that the consensus-filtered reasoning features transfer broadly and lead to higher-quality CoT.
- Smoother Training Landscape: Analysis showed that MoT trains in a ‘flatter’ region of the loss landscape, leading to more stable and robust performance compared to other methods.
The authors emphasize that MoT offers a simple and scalable route for efficiently distilling long Chain-of-Thought capabilities from diverse teacher models into more compact and efficient student models. This work marks a significant step towards making advanced AI reasoning more accessible and deployable. For more details, you can refer to the original research paper: Merge-of-Thought Distillation.


