ZTE's Kant System: A New Standard for AI Cluster Scheduling

TLDR: Kant is a unified scheduling system for large-scale AI clusters, developed by ZTE Corporation. It addresses challenges in managing diverse AI workloads (training and inference) across heterogeneous GPUs. Kant improves resource utilization, reduces fragmentation, and enhances job response times through strategies like Backfill, Enhanced Binpack, and Topology-Aware scheduling. Experimental results show significant performance gains in both large-scale training and small-scale inference environments, making it a robust solution for modern AI infrastructure.

As the world of Artificial Intelligence continues its rapid expansion, particularly with the widespread adoption of large language models (LLMs), the demand for powerful and efficient computing infrastructure has skyrocketed. AI clusters, now scaling from hundreds to tens of thousands of GPUs, face immense challenges in managing resources effectively. Traditional scheduling systems often struggle to balance resource utilization, scheduling efficiency, and the quality of service for diverse AI workloads.

Addressing these critical issues, ZTE Corporation has introduced Kant, an innovative and unified scheduling platform designed specifically for large-scale AI container clusters. Kant is engineered to co-schedule both training and inference jobs, ensuring optimal performance and resource management in complex, heterogeneous environments. You can read the full research paper here: Kant: An Efficient Unified Scheduling System for Large-Scale AI Clusters.

The Evolving Landscape of AI Clusters and Their Challenges

Modern AI clusters are characterized by several key traits that pose significant hurdles for resource scheduling:

Massive Scalability: Clusters are growing to unprecedented sizes, demanding systems that can handle thousands of GPUs.
GPU Heterogeneity: Different GPU models with varying performance capabilities create complex resource pools.
Tenant Diversity: Multiple users with distinct resource needs require fair allocation and isolation.
Varied Job Sizes: While most jobs are small, a few large-scale training tasks consume the majority of GPU computing time.
Diverse Task Types: The system must support LLM distributed training (efficiency-focused), inference services (low latency, high availability), and development tasks (flexibility).

Existing scheduling solutions, such as HPC schedulers like SLURM, lack modern container orchestration capabilities, while cloud-native schedulers like Kubernetes often suffer from low resource utilization and high latency when dealing with large-scale AI workloads. Kant aims to bridge this gap by providing a unified, high-performance solution.

Kant’s Unified Architecture: A Dual-Component Approach

The Kant system is built on Kubernetes and features a distributed, unified architecture centered around two core components:

QSCH (Queue-based Scheduler): This module manages job queuing, admission control, and preemption strategies. It ensures fair scheduling across multiple tenants and task types, preventing resource starvation and optimizing job flow.
RSCH (Resource-aware Scheduler): Focused on fine-grained resource allocation, RSCH supports various scheduling strategies tailored for diverse AI tasks. It can be deployed in multiple instances to achieve high scheduling throughput in large clusters.

This decoupled design enhances system scalability and efficiency, considering factors like resource utilization, scheduling latency, and user experience.

Key Strategies for Enhanced Performance

Kant incorporates several advanced scheduling strategies:

Multi-tenant Fair Scheduling: Through admission, queuing, and preemption controls, Kant ensures fair resource distribution and isolation. The Backfill queuing strategy, for instance, allows smaller jobs to utilize idle resources while a larger job waits, improving overall throughput without indefinitely postponing large tasks.
Efficient GPU Utilization: Fine-grained GPU scheduling, Gang scheduling (all-or-nothing resource allocation for distributed jobs), and Enhanced Binpack (E-Binpack) strategies maximize GPU utilization by consolidating workloads and reducing fragmentation. E-Binpack intelligently co-locates job replicas to minimize communication overhead.
Meeting SLA Requirements: For inference services, Enhanced Spread (E-Spread) distributes replicas across nodes for fault tolerance and high availability. It also supports dedicated zones for inference, preserving full-node resources for large-scale distributed inference tasks.
Topology-Aware Scheduling: Kant intelligently places workloads based on the communication quality between GPUs, both within a node (NVLink, PCIe) and across nodes (RDMA network hierarchy), significantly reducing communication overhead for intensive tasks like distributed training.

Optimizing for Scale and Speed

To handle the demands of massive AI clusters, Kant includes performance optimization mechanisms such as splitting heterogeneous clusters into GPU Type-based Node Pools, hierarchical scheduling with Node Grouping (abstracting network topologies into NodeNetGroups), and memory optimization techniques to reduce CPU and memory overhead during scheduling cycles.

Measuring Success: Kant’s Performance Metrics

The Kant system defines a comprehensive set of metrics to evaluate scheduling performance:

GPU Allocation Ratio (GAR): Measures the overall extent of GPU resources allocated.
Scheduling Occupation Ratio (SOR): Reflects the efficiency of GPU resource utilization over time.
GPU Node Fragmentation Ratio (GFR): Indicates the proportion of partially occupied nodes.
Job Waiting Time Distribution (JWTD): Tracks the latency between job submission and scheduling.
Job Training Time Estimation Distribution (JTTED): Assesses how closely scheduling aligns with optimal communication topology for training jobs.

Experimental Validation: Real-World Performance

Experiments conducted on both large-scale training clusters (8,000 GPUs) and small-scale inference clusters (hundreds of GPUs) demonstrated Kant’s superior performance. In training scenarios, Backfill and E-Binpack strategies significantly improved GAR and SOR, reduced GFR, shortened JWTD, and brought JTTED closer to optimal, indicating reduced communication overhead. For inference clusters, Kant maintained high GAR, reasonable GFR, and provided rapid response and high availability, while effectively managing GPU quotas in multi-tenant, heterogeneous environments.

Also Read:

Conclusion and Future Outlook

The Kant system represents a significant advancement in AI cluster scheduling, offering an efficient, unified platform that addresses the complex demands of modern AI workloads. By leveraging intelligent scheduling strategies and robust architectural design, Kant ensures high resource utilization, low latency, and strong service quality. Currently deployed in multiple AI container clusters, Kant is stably supporting large-scale intelligent computing tasks. Future work aims to further enhance response speed, fault tolerance, explore cross-cluster scheduling, and deepen collaboration with AI training frameworks to continuously push the boundaries of AI cluster efficiency.

ZTE’s Kant System: A New Standard for AI Cluster Scheduling