Optimizing LLM Serving with Predictive Scheduling: Introducing Block

TLDR: Block is a novel distributed scheduling framework for Large Language Model (LLM) serving. It uses a predictive approach, combining response length estimation and inference simulation, to make intelligent scheduling and auto-provisioning decisions. This allows Block to significantly improve load balancing, boost serving capacity by up to 16.7%, and reduce P99 tail latency by up to 49.5% compared to traditional heuristic schedulers, ensuring more efficient and scalable LLM inference.

Large Language Models (LLMs) like GPT-4 and Llama have transformed various applications, from chatbots to code generation. However, serving these powerful models efficiently presents significant challenges, particularly in managing the unpredictable nature of their inference processes. Traditional LLM serving systems often struggle with load balancing and resource allocation, leading to performance bottlenecks and inconsistent user experiences.

A new distributed scheduling framework called Block aims to tackle these issues head-on. Developed by Wei Da and Evangelia Kalyvianaki from the University of Cambridge, Block introduces a predictive scheduling system that leverages contextual information from incoming requests to optimize load balancing and auto-provisioning across LLM instances. Unlike many existing systems that rely on simple, heuristic-based schedulers, Block operates as a fully distributed, stateless, and predictive system, designed for low overhead, high reliability, and scalability.

The Challenge of LLM Serving

The core difficulty in serving LLMs stems from their autoregressive nature, where tokens are generated sequentially. This leads to highly variable response lengths and decoding steps. Techniques like continuous batching and paged attention have improved efficiency, but they also introduce dynamic memory consumption and the possibility of request preemption, where ongoing requests are temporarily halted due to insufficient memory. These uncertainties make it hard for conventional schedulers, which often use basic rules like round-robin, to accurately manage the workload and prevent performance degradation.

Furthermore, auto-provisioning – the dynamic scaling of resources to meet demand – is also complicated. When new instances are added, existing heavily loaded instances might continue to process requests, leading to load imbalances and resource wastage, a phenomenon similar to ‘cold starts’ in serverless computing.

How Block Works

Block addresses these challenges by integrating two key insights: the ability to accurately predict response lengths and the feasibility of simulating LLM inference performance. The framework consists of four main services:

Query Length Tagger: This is the entry point for requests. It uses a lightweight LLM-based regression model to estimate the anticipated length of the response based on the input prompt. This prediction is crucial for proactive scheduling.
Predictor: Running as a sidecar service on each LLM instance, the Predictor simulates and forecasts key performance metrics, such as end-to-end latency or Time-To-First-Token (TTFT), for incoming requests. It adapts existing simulation frameworks like Vidur to provide real-time predictions.
Global Scheduler: This service is fully distributed and stateless, ensuring scalability. Instead of maintaining a centralized view of instance statuses, it queries the Predictor services on each instance to get real-time metrics and predictions. Based on these predictions, it dispatches requests to the most appropriate model instances to balance the load.
Inference Framework Backend: This is where the actual LLM execution and response generation happen, typically using frameworks like vLLM. Block is designed to be agnostic to the specific backend framework.

By estimating response lengths and simulating performance, Block can anticipate the resource demands and execution duration of requests, allowing it to make informed scheduling decisions. This predictive approach helps avoid situations where unexpectedly long responses overload a host or block subsequent requests.

Also Read:

Performance and Impact

Evaluations conducted on a 12-GPU cluster using real-world datasets and the LLaMA2-7B model demonstrated Block’s superior performance. Block consistently outperformed widely-used heuristic schedulers across various metrics:

It boosted serving capacity by up to 16.7%.
It increased throughput by up to 4.4%.
It significantly reduced average request latency by 19.9-45.8% and P99 tail latency (the latency experienced by 99% of requests) by 12.6-49.5%.
Notably, Time-To-First-Token (TTFT) saw even more dramatic reductions, with average TTFT decreasing by 88.1-97.0% and P99 TTFT by 78.6-94.5%.

Block also proved effective in auto-provisioning. By using predicted metrics to preemptively add instances when needed, it achieved smoother changes in cluster size, higher GPU utilization, and a 20.1% reduction in P99 latency compared to reactive provisioning strategies. The framework’s generality was also confirmed, showing consistent advantages even when backend configurations, models (like Qwen2-7B), or datasets (like BurstGPT) were varied.

The research paper, titled “Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling,” highlights how predictive scheduling can lead to more efficient, responsive, and scalable LLM serving systems. The code and data for Block are open-sourced, paving the way for future advancements in this critical area. You can find more details about this research here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Optimizing LLM Serving with Predictive Scheduling: Introducing Block

The Challenge of LLM Serving

How Block Works

Performance and Impact

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates