TLDR: Block is a novel distributed scheduling framework for Large Language Model (LLM) serving. It uses a predictive approach, combining response length estimation and inference simulation, to make intelligent scheduling and auto-provisioning decisions. This allows Block to significantly improve load balancing, boost serving capacity by up to 16.7%, and reduce P99 tail latency by up to 49.5% compared to traditional heuristic schedulers, ensuring more efficient and scalable LLM inference.
Large Language Models (LLMs) like GPT-4 and Llama have transformed various applications, from chatbots to code generation. However, serving these powerful models efficiently presents significant challenges, particularly in managing the unpredictable nature of their inference processes. Traditional LLM serving systems often struggle with load balancing and resource allocation, leading to performance bottlenecks and inconsistent user experiences.
A new distributed scheduling framework called Block aims to tackle these issues head-on. Developed by Wei Da and Evangelia Kalyvianaki from the University of Cambridge, Block introduces a predictive scheduling system that leverages contextual information from incoming requests to optimize load balancing and auto-provisioning across LLM instances. Unlike many existing systems that rely on simple, heuristic-based schedulers, Block operates as a fully distributed, stateless, and predictive system, designed for low overhead, high reliability, and scalability.
The Challenge of LLM Serving
The core difficulty in serving LLMs stems from their autoregressive nature, where tokens are generated sequentially. This leads to highly variable response lengths and decoding steps. Techniques like continuous batching and paged attention have improved efficiency, but they also introduce dynamic memory consumption and the possibility of request preemption, where ongoing requests are temporarily halted due to insufficient memory. These uncertainties make it hard for conventional schedulers, which often use basic rules like round-robin, to accurately manage the workload and prevent performance degradation.
Furthermore, auto-provisioning – the dynamic scaling of resources to meet demand – is also complicated. When new instances are added, existing heavily loaded instances might continue to process requests, leading to load imbalances and resource wastage, a phenomenon similar to ‘cold starts’ in serverless computing.
How Block Works
Block addresses these challenges by integrating two key insights: the ability to accurately predict response lengths and the feasibility of simulating LLM inference performance. The framework consists of four main services:
- Query Length Tagger: This is the entry point for requests. It uses a lightweight LLM-based regression model to estimate the anticipated length of the response based on the input prompt. This prediction is crucial for proactive scheduling.
- Predictor: Running as a sidecar service on each LLM instance, the Predictor simulates and forecasts key performance metrics, such as end-to-end latency or Time-To-First-Token (TTFT), for incoming requests. It adapts existing simulation frameworks like Vidur to provide real-time predictions.
- Global Scheduler: This service is fully distributed and stateless, ensuring scalability. Instead of maintaining a centralized view of instance statuses, it queries the Predictor services on each instance to get real-time metrics and predictions. Based on these predictions, it dispatches requests to the most appropriate model instances to balance the load.
- Inference Framework Backend: This is where the actual LLM execution and response generation happen, typically using frameworks like vLLM. Block is designed to be agnostic to the specific backend framework.
By estimating response lengths and simulating performance, Block can anticipate the resource demands and execution duration of requests, allowing it to make informed scheduling decisions. This predictive approach helps avoid situations where unexpectedly long responses overload a host or block subsequent requests.
Also Read:
- Enhancing LLM Communication with FlashCommunication V2’s Bit Splitting and Spike Reserving
- Optimizing AI Agent Deployment and Movement in Edge Computing
Performance and Impact
Evaluations conducted on a 12-GPU cluster using real-world datasets and the LLaMA2-7B model demonstrated Block’s superior performance. Block consistently outperformed widely-used heuristic schedulers across various metrics:
- It boosted serving capacity by up to 16.7%.
- It increased throughput by up to 4.4%.
- It significantly reduced average request latency by 19.9-45.8% and P99 tail latency (the latency experienced by 99% of requests) by 12.6-49.5%.
- Notably, Time-To-First-Token (TTFT) saw even more dramatic reductions, with average TTFT decreasing by 88.1-97.0% and P99 TTFT by 78.6-94.5%.
Block also proved effective in auto-provisioning. By using predicted metrics to preemptively add instances when needed, it achieved smoother changes in cluster size, higher GPU utilization, and a 20.1% reduction in P99 latency compared to reactive provisioning strategies. The framework’s generality was also confirmed, showing consistent advantages even when backend configurations, models (like Qwen2-7B), or datasets (like BurstGPT) were varied.
The research paper, titled “Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling,” highlights how predictive scheduling can lead to more efficient, responsive, and scalable LLM serving systems. The code and data for Block are open-sourced, paving the way for future advancements in this critical area. You can find more details about this research here.


