LongCat-Flash: Meituan's 560 Billion Parameter Model Sets New Standards for Efficiency and Agentic AI

TLDR: LongCat-Flash is a 560-billion-parameter Mixture-of-Experts (MoE) language model from Meituan, designed for high computational efficiency and advanced agentic capabilities. It features dynamic computation allocation via “Zero-computation Experts” and improved inference efficiency with “Shortcut-connected MoE.” Trained on over 20 trillion tokens in 30 days, it achieves over 100 tokens per second inference at $0.70 per million output tokens, demonstrating competitive performance in general and exceptional strength in agentic tasks. The model checkpoint is open-sourced.

The world of large language models (LLMs) is constantly evolving, with new innovations pushing the boundaries of what AI can achieve. A recent technical report introduces LongCat-Flash, a formidable 560-billion-parameter Mixture-of-Experts (MoE) language model developed by the Meituan LongCat Team. This model is designed to excel in both computational efficiency and advanced agentic capabilities, addressing the growing need for scalable yet powerful AI.

LongCat-Flash stands out with two primary architectural innovations. First, it features “Zero-computation Experts.” This clever design allows the model to dynamically allocate its computational budget. Instead of activating a fixed number of parameters for every token, LongCat-Flash activates between 18.6 billion and 31.3 billion parameters (averaging 27 billion) per token, depending on the complexity and contextual demands. This means that simpler parts of a text require less processing power, while more challenging parts receive more, optimizing resource usage significantly. The system even includes a PID controller to ensure a consistent average computational load, preventing under- or over-utilization of its “zero-computation” experts, which simply pass the input through without additional cost.

The second key innovation is the “Shortcut-connected MoE” (ScMoE). In large MoE models, the communication between different “experts” can become a bottleneck, slowing down both training and inference. ScMoE tackles this by reordering the execution pipeline, significantly enlarging the window where computation and communication can happen simultaneously. This design has shown remarkable gains in inference efficiency and throughput without compromising the model’s quality, a crucial factor for scaling up such massive models.

Training a model of LongCat-Flash’s scale (560 billion parameters) is a monumental task. The Meituan LongCat Team developed a comprehensive scaling framework to ensure stable and reproducible training. This framework includes hyperparameter transfer, which uses insights from smaller models to predict optimal settings for the larger one, and a “model-growth initialization” strategy, where the model starts from a half-scale version pre-trained on trillions of tokens. This approach leads to faster convergence and better performance compared to traditional random initialization. Furthermore, a multi-pronged stability suite, including router-gradient balancing and a “hidden z-loss” to prevent massive activations, ensures the training process remains robust and free from irrecoverable loss spikes. The team also implemented deterministic computation, guaranteeing exact reproducibility and aiding in the detection of silent data corruption.

To cultivate LongCat-Flash’s agentic intelligence, the model underwent a multi-stage training pipeline. This began with large-scale pre-training on optimized data mixtures, followed by targeted mid- and post-training phases focusing on reasoning, code, and instructions. The process was further augmented with synthetic data and tool-use tasks. A unique multi-agent synthesis framework was designed to generate high-quality, challenging tasks by controlling complexity across information processing, tool-set usage, and user interaction. This meticulous approach ensures LongCat-Flash can perform complex tasks requiring iterative reasoning and interaction with various environments.

The results are impressive. LongCat-Flash completed its pre-training on over 20 trillion tokens within 30 days, achieving over 100 tokens per second (TPS) for inference at a cost of just $0.70 per million output tokens on H800 GPUs. This demonstrates exceptional efficiency for a model of its size. In comprehensive evaluations, LongCat-Flash delivers highly competitive performance among other leading models, with particular strengths in agentic tasks. For instance, it scored 86.5 on ArenaHard-V2, 39.5 on TerminalBench, and 67.7 on τ2-Bench, showcasing its robust capabilities in general domains, coding, and agentic tool use.

Also Read:

The Meituan LongCat Team has open-sourced the model checkpoint of LongCat-Flash to foster community research and innovation. This move is expected to accelerate advancements in efficient MoE architectures, data strategies, and agentic model development. For more technical details, you can refer to the original research paper: LongCat-Flash Technical Report.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

LongCat-Flash: Meituan’s 560 Billion Parameter Model Sets New Standards for Efficiency and Agentic AI

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates