TLDR: HeteroScale is a new autoscaling framework designed for large language model (LLM) inference, specifically addressing challenges in Prefill-Decode (P/D) disaggregated architectures. Traditional autoscalers fail to manage heterogeneous hardware, network bottlenecks, and architectural imbalances. HeteroScale uses a topology-aware scheduler, novel network-aware abstractions, and a metric-driven policy (primarily decode Tokens-Per-Second) to coordinate scaling of prefill and decode stages. Deployed at ByteDance on tens of thousands of GPUs, it significantly increased GPU utilization by 26.6 percentage points and saved hundreds of thousands of GPU-hours daily while maintaining service quality.
Large Language Models (LLMs) are at the forefront of many modern applications, from chatbots to search engines. However, serving these powerful models efficiently is a significant challenge, primarily due to their intensive demand for Graphics Processing Units (GPUs). Traditional autoscaling systems, like those used in Kubernetes, often fall short when dealing with the unique requirements of modern LLM serving architectures, especially those that separate the ‘prefill’ and ‘decode’ phases of inference.
The Prefill-Decode (P/D) disaggregated architecture, while offering powerful optimization opportunities, introduces several complex operational hurdles. These include inefficient use of diverse hardware, network congestion, and critical imbalances between the prefill (processing the input prompt) and decode (generating tokens one by one) stages. Addressing these issues is crucial for maintaining performance and controlling costs in large-scale LLM deployments.
Introducing HeteroScale: A Coordinated Autoscaling Framework
Researchers from ByteDance Seed and the National University of Singapore have introduced HeteroScale, a coordinated autoscaling framework designed to tackle these core challenges in P/D disaggregated LLM serving. HeteroScale combines a smart, topology-aware scheduler with a new metric-driven policy, derived from extensive real-world data, to manage resources efficiently.
The framework’s key innovations include:
- Heterogeneous Resource Management: HeteroScale treats the P/D ratio and specific hardware requirements as primary scheduling constraints. Its scheduler intelligently places different service roles (prefill or decode) on the most suitable hardware, considering network connections and maintaining the crucial balance between prefill and decode stages.
- Network-Aware Scheduling: To minimize latency during the transfer of large Key-Value (KV) caches between prefill and decode instances, HeteroScale uses abstractions like ‘Deployment Groups’ and ‘RDMA Subgroups’. These ensure that related instances are placed close together within the network, optimizing the use of high-performance hardware.
- Data-Driven Scaling Policies: After a comprehensive analysis of autoscaling signals from massive production datasets, HeteroScale identified ‘decode Tokens-Per-Second (TPS)’ as the most reliable metric. Unlike conventional hardware metrics (like GPU utilization), which can be misleading for memory-bound decode stages, decode TPS provides a robust signal to jointly scale both prefill and decode pools, ensuring architectural balance.
How HeteroScale Works
HeteroScale operates through a layered architecture, including an autoscaling layer with a policy engine, a federated pre-scheduling layer, and a sub-cluster scheduling layer. The policy engine uses both periodic and metrics-driven strategies. While periodic scaling handles predictable traffic patterns, the metrics-driven policy, primarily using decode TPS, provides fine-grained, real-time adjustments. This is crucial because metrics like GPU utilization can be deceptive for decode nodes, which often show high utilization due to memory pressure even at low workloads.
The federated pre-scheduling layer translates scaling decisions into actual resource placements. It manages heterogeneous GPU resources, using ‘Deployment Groups’ to ensure network affinity and ‘RDMA Subgroups’ to prioritize different hardware pools. This ensures that high-value, high-performance resources are reserved for the workloads that need them most. The system also actively maintains the optimal P/D ratio, which can vary significantly based on workload characteristics, to prevent bottlenecks.
To ensure stability, HeteroScale incorporates anti-flapping mechanisms (cooling periods, hysteresis thresholds, dampening factors) and disaster recovery measures like ‘soft scaling in’. Soft scaling in allows instances to be withdrawn from service but kept running, ready to be reinstated if performance degrades, avoiding costly startup delays.
Also Read:
- Equinox: A New Approach to Fair Resource Allocation in Large Language Model Serving
- ClusterFusion: Boosting LLM Inference Speed with On-Chip Data Handling
Proven Impact in Production
HeteroScale has been successfully deployed in ByteDance’s massive production environment, managing tens of thousands of GPUs. This real-world application has demonstrated significant benefits, including a 26.6 percentage point increase in average GPU utilization and daily savings of hundreds of thousands of GPU-hours. Crucially, these efficiency gains were achieved while consistently meeting stringent service level objectives (SLOs).
The TPS-based policy, which manages 64% of the GPU fleet under HeteroScale, proved more efficient than the periodic policy, delivering 10.0 percentage points higher GPU utilization. This highlights the importance of dynamic, real-time adjustments to workload fluctuations.
For more in-depth technical details, you can refer to the original research paper: Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference.
HeteroScale sets a new benchmark for robust, efficient, and scalable LLM serving platforms, addressing critical challenges in large-scale AI infrastructure and paving the way for future advancements in resource management for evolving LLM services.


