spot_img
HomeResearch & DevelopmentLongCat-Flash: Meituan's 560 Billion Parameter Model Sets New Standards...

LongCat-Flash: Meituan’s 560 Billion Parameter Model Sets New Standards for Efficiency and Agentic AI

TLDR: LongCat-Flash is a 560-billion-parameter Mixture-of-Experts (MoE) language model from Meituan, designed for high computational efficiency and advanced agentic capabilities. It features dynamic computation allocation via “Zero-computation Experts” and improved inference efficiency with “Shortcut-connected MoE.” Trained on over 20 trillion tokens in 30 days, it achieves over 100 tokens per second inference at $0.70 per million output tokens, demonstrating competitive performance in general and exceptional strength in agentic tasks. The model checkpoint is open-sourced.

The world of large language models (LLMs) is constantly evolving, with new innovations pushing the boundaries of what AI can achieve. A recent technical report introduces LongCat-Flash, a formidable 560-billion-parameter Mixture-of-Experts (MoE) language model developed by the Meituan LongCat Team. This model is designed to excel in both computational efficiency and advanced agentic capabilities, addressing the growing need for scalable yet powerful AI.

LongCat-Flash stands out with two primary architectural innovations. First, it features “Zero-computation Experts.” This clever design allows the model to dynamically allocate its computational budget. Instead of activating a fixed number of parameters for every token, LongCat-Flash activates between 18.6 billion and 31.3 billion parameters (averaging 27 billion) per token, depending on the complexity and contextual demands. This means that simpler parts of a text require less processing power, while more challenging parts receive more, optimizing resource usage significantly. The system even includes a PID controller to ensure a consistent average computational load, preventing under- or over-utilization of its “zero-computation” experts, which simply pass the input through without additional cost.

The second key innovation is the “Shortcut-connected MoE” (ScMoE). In large MoE models, the communication between different “experts” can become a bottleneck, slowing down both training and inference. ScMoE tackles this by reordering the execution pipeline, significantly enlarging the window where computation and communication can happen simultaneously. This design has shown remarkable gains in inference efficiency and throughput without compromising the model’s quality, a crucial factor for scaling up such massive models.

Training a model of LongCat-Flash’s scale (560 billion parameters) is a monumental task. The Meituan LongCat Team developed a comprehensive scaling framework to ensure stable and reproducible training. This framework includes hyperparameter transfer, which uses insights from smaller models to predict optimal settings for the larger one, and a “model-growth initialization” strategy, where the model starts from a half-scale version pre-trained on trillions of tokens. This approach leads to faster convergence and better performance compared to traditional random initialization. Furthermore, a multi-pronged stability suite, including router-gradient balancing and a “hidden z-loss” to prevent massive activations, ensures the training process remains robust and free from irrecoverable loss spikes. The team also implemented deterministic computation, guaranteeing exact reproducibility and aiding in the detection of silent data corruption.

To cultivate LongCat-Flash’s agentic intelligence, the model underwent a multi-stage training pipeline. This began with large-scale pre-training on optimized data mixtures, followed by targeted mid- and post-training phases focusing on reasoning, code, and instructions. The process was further augmented with synthetic data and tool-use tasks. A unique multi-agent synthesis framework was designed to generate high-quality, challenging tasks by controlling complexity across information processing, tool-set usage, and user interaction. This meticulous approach ensures LongCat-Flash can perform complex tasks requiring iterative reasoning and interaction with various environments.

The results are impressive. LongCat-Flash completed its pre-training on over 20 trillion tokens within 30 days, achieving over 100 tokens per second (TPS) for inference at a cost of just $0.70 per million output tokens on H800 GPUs. This demonstrates exceptional efficiency for a model of its size. In comprehensive evaluations, LongCat-Flash delivers highly competitive performance among other leading models, with particular strengths in agentic tasks. For instance, it scored 86.5 on ArenaHard-V2, 39.5 on TerminalBench, and 67.7 on τ2-Bench, showcasing its robust capabilities in general domains, coding, and agentic tool use.

Also Read:

The Meituan LongCat Team has open-sourced the model checkpoint of LongCat-Flash to foster community research and innovation. This move is expected to accelerate advancements in efficient MoE architectures, data strategies, and agentic model development. For more technical details, you can refer to the original research paper: LongCat-Flash Technical Report.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -