Optimizing LLM Inference with Coordinated Autoscaling

TLDR: HeteroScale is a new autoscaling framework designed for large language model (LLM) inference, specifically addressing challenges in Prefill-Decode (P/D) disaggregated architectures. Traditional autoscalers fail to manage heterogeneous hardware, network bottlenecks, and architectural imbalances. HeteroScale uses a topology-aware scheduler, novel network-aware abstractions, and a metric-driven policy (primarily decode Tokens-Per-Second) to coordinate scaling of prefill and decode stages. Deployed at ByteDance on tens of thousands of GPUs, it significantly increased GPU utilization by 26.6 percentage points and saved hundreds of thousands of GPU-hours daily while maintaining service quality.

Large Language Models (LLMs) are at the forefront of many modern applications, from chatbots to search engines. However, serving these powerful models efficiently is a significant challenge, primarily due to their intensive demand for Graphics Processing Units (GPUs). Traditional autoscaling systems, like those used in Kubernetes, often fall short when dealing with the unique requirements of modern LLM serving architectures, especially those that separate the ‘prefill’ and ‘decode’ phases of inference.

The Prefill-Decode (P/D) disaggregated architecture, while offering powerful optimization opportunities, introduces several complex operational hurdles. These include inefficient use of diverse hardware, network congestion, and critical imbalances between the prefill (processing the input prompt) and decode (generating tokens one by one) stages. Addressing these issues is crucial for maintaining performance and controlling costs in large-scale LLM deployments.

Introducing HeteroScale: A Coordinated Autoscaling Framework

Researchers from ByteDance Seed and the National University of Singapore have introduced HeteroScale, a coordinated autoscaling framework designed to tackle these core challenges in P/D disaggregated LLM serving. HeteroScale combines a smart, topology-aware scheduler with a new metric-driven policy, derived from extensive real-world data, to manage resources efficiently.

The framework’s key innovations include:

Heterogeneous Resource Management: HeteroScale treats the P/D ratio and specific hardware requirements as primary scheduling constraints. Its scheduler intelligently places different service roles (prefill or decode) on the most suitable hardware, considering network connections and maintaining the crucial balance between prefill and decode stages.
Network-Aware Scheduling: To minimize latency during the transfer of large Key-Value (KV) caches between prefill and decode instances, HeteroScale uses abstractions like ‘Deployment Groups’ and ‘RDMA Subgroups’. These ensure that related instances are placed close together within the network, optimizing the use of high-performance hardware.
Data-Driven Scaling Policies: After a comprehensive analysis of autoscaling signals from massive production datasets, HeteroScale identified ‘decode Tokens-Per-Second (TPS)’ as the most reliable metric. Unlike conventional hardware metrics (like GPU utilization), which can be misleading for memory-bound decode stages, decode TPS provides a robust signal to jointly scale both prefill and decode pools, ensuring architectural balance.

How HeteroScale Works

HeteroScale operates through a layered architecture, including an autoscaling layer with a policy engine, a federated pre-scheduling layer, and a sub-cluster scheduling layer. The policy engine uses both periodic and metrics-driven strategies. While periodic scaling handles predictable traffic patterns, the metrics-driven policy, primarily using decode TPS, provides fine-grained, real-time adjustments. This is crucial because metrics like GPU utilization can be deceptive for decode nodes, which often show high utilization due to memory pressure even at low workloads.

The federated pre-scheduling layer translates scaling decisions into actual resource placements. It manages heterogeneous GPU resources, using ‘Deployment Groups’ to ensure network affinity and ‘RDMA Subgroups’ to prioritize different hardware pools. This ensures that high-value, high-performance resources are reserved for the workloads that need them most. The system also actively maintains the optimal P/D ratio, which can vary significantly based on workload characteristics, to prevent bottlenecks.

To ensure stability, HeteroScale incorporates anti-flapping mechanisms (cooling periods, hysteresis thresholds, dampening factors) and disaster recovery measures like ‘soft scaling in’. Soft scaling in allows instances to be withdrawn from service but kept running, ready to be reinstated if performance degrades, avoiding costly startup delays.

Also Read:

Proven Impact in Production

HeteroScale has been successfully deployed in ByteDance’s massive production environment, managing tens of thousands of GPUs. This real-world application has demonstrated significant benefits, including a 26.6 percentage point increase in average GPU utilization and daily savings of hundreds of thousands of GPU-hours. Crucially, these efficiency gains were achieved while consistently meeting stringent service level objectives (SLOs).

The TPS-based policy, which manages 64% of the GPU fleet under HeteroScale, proved more efficient than the periodic policy, delivering 10.0 percentage points higher GPU utilization. This highlights the importance of dynamic, real-time adjustments to workload fluctuations.

For more in-depth technical details, you can refer to the original research paper: Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference.

HeteroScale sets a new benchmark for robust, efficient, and scalable LLM serving platforms, addressing critical challenges in large-scale AI infrastructure and paving the way for future advancements in resource management for evolving LLM services.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Optimizing LLM Inference with Coordinated Autoscaling

Introducing HeteroScale: A Coordinated Autoscaling Framework

How HeteroScale Works

Proven Impact in Production

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Bairong Inc. and Shanghai Pudong Development Bank Forge AI-Powered Strategic Alliance for Financial Agent Deployment

Google BigQuery Revolutionizes Data Management with AI-Powered Transformation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates