Tesserae: A New Approach to Efficient GPU Cluster Scheduling for Deep Learning

TLDR: Tesserae is a novel GPU cluster scheduler for deep learning workloads that addresses the limitations of existing placement policies. It formulates job placement and migration as graph matching problems, enabling scalable and efficient solutions. Tesserae introduces new algorithms for minimizing job migrations and maximizing job packing throughput, including optimizing parallelism strategies for large language models. Experimental results show significant improvements in job completion time, makespan, and fairness, while demonstrating strong adaptability and scalability for large-scale deep learning clusters.

Deep learning (DL) models are at the heart of modern data centers, and ensuring their efficient training is a top priority. A critical aspect of this efficiency lies in how jobs are placed on powerful GPU clusters. Traditionally, schedulers have relied on either simple, ad-hoc rules or complex optimization problems to decide where to run these demanding workloads. However, both approaches have significant drawbacks: ad-hoc rules often lead to suboptimal performance, while complex optimizations struggle to scale as clusters grow larger and the number of jobs increases.

Enter Tesserae, a novel approach designed to overcome these limitations. Researchers at the University of Wisconsin-Madison, Song Bian, Saurabh Agarwal, Md. Tareq Mahmood, and Shivaram Venkataraman, developed Tesserae based on a key insight: many deep learning job placement challenges can be elegantly framed as graph matching problems. This mathematical formulation allows for efficient solutions using well-established algorithms, leading to a more scalable and effective GPU cluster scheduler.

Minimizing Job Migrations for Smoother Operations

One of the hidden costs in GPU cluster management is job migration. When a job moves from one set of GPUs to another between scheduling rounds, it incurs overhead that can slow down overall progress. Tesserae introduces an innovative migration algorithm that significantly reduces these disruptions. By modeling the current and future placement plans as a graph, Tesserae can identify the optimal way to reassign jobs to GPUs, minimizing unnecessary movements. This intelligent approach helps maintain high throughput and reduces the time jobs spend waiting or relocating.

Efficient Packing for Maximized GPU Utilization

Another core component of Tesserae is its efficient job packing policy. Packing involves running multiple deep learning jobs concurrently on the same GPUs to maximize resource utilization. Tesserae transforms this into a maximum weighted bipartite graph matching problem. In this graph, jobs already running are matched with jobs waiting to be placed, and the ‘weight’ of a potential match represents the combined throughput (performance) of those jobs when packed together. By solving this problem, Tesserae ensures that jobs are packed in a way that maximizes the total cluster throughput.

A notable feature of Tesserae’s packing policy is its ability to consider different parallelism strategies for large language models. These models can be trained using various techniques (like data parallelism or pipeline parallelism), and the choice can significantly impact performance, especially when packed with other jobs. Tesserae intelligently selects the best parallelism strategy to further boost combined throughput and prevent issues like out-of-memory errors.

Also Read:

Real-World Impact and Scalability

The effectiveness of Tesserae has been demonstrated through extensive experiments on both physical GPU clusters and large-scale simulations. Compared to existing schedulers like Tiresias and Gavel, Tesserae has shown remarkable improvements, reducing average Job Completion Time (JCT) by up to 1.62 times and Makespan (the total time to complete all jobs) by up to 1.15 times. It also improves fairness metrics, ensuring a more equitable distribution of resources among jobs.

Crucially, Tesserae is designed for adaptability and scalability. It can seamlessly adjust to different hardware configurations, such as varying GPU types, without requiring manual tuning. Its modular design allows it to be integrated with various existing scheduling policies, making it a versatile solution for diverse cluster environments. Furthermore, Tesserae proves highly scalable, capable of making placement decisions for clusters with thousands of GPUs and thousands of active jobs within seconds, a significant improvement over prior optimization-based methods that struggle with increasing scale.

This research marks a significant step forward in deep learning cluster scheduling, offering a principled, efficient, and scalable framework for managing complex workloads. For more in-depth information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Tesserae: A New Approach to Efficient GPU Cluster Scheduling for Deep Learning

Minimizing Job Migrations for Smoother Operations

Efficient Packing for Maximized GPU Utilization

Real-World Impact and Scalability

Gen AI News and Updates

Infibeam Avenues Reports Stellar 93% Revenue Growth, Pivots to AI-Driven Payment Solutions

AirTrunk, Backed by Blackstone, Fuels India’s AI Boom with Major Data Center Expansion

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates