TLDR: DCcluster-Opt is an open-source simulation benchmark for optimizing AI workload management in globally distributed data centers. It integrates real-world data on environmental factors, detailed data center physics, and network dynamics to provide a high-fidelity testbed. The platform allows researchers to develop and evaluate multi-objective scheduling algorithms that balance carbon emissions, energy costs, service level agreements, and water use, accelerating the development of sustainable computing solutions.
The rapid expansion of Artificial Intelligence (AI) is leading to a significant increase in energy consumption and carbon emissions from data centers worldwide. Managing these vast, globally distributed computing systems sustainably is a complex challenge, often hampered by the lack of realistic tools to test new solutions.
A new research paper introduces DCcluster-Opt, an open-source, high-fidelity simulation benchmark designed to address this critical need. This innovative platform allows researchers to develop and evaluate advanced strategies for managing AI workloads in a way that balances performance with crucial sustainability goals like reducing carbon emissions, energy costs, and water usage.
Bridging the Gap in Sustainable Computing
Traditional research in data center management often relies on simplified models, which don’t fully capture the intricate interplay of real-world factors. DCcluster-Opt stands out by integrating a comprehensive set of dynamic elements:
- Time-Varying Environmental Factors: It considers real-time grid carbon intensity, fluctuating electricity prices, and local weather conditions across 20 global regions.
- Detailed Data Center Physics: The simulation models the energy consumption of IT components (CPUs, GPUs, memory) and the crucial HVAC (heating, ventilation, and air conditioning) systems.
- Geo-Distributed Network Dynamics: It accounts for network latency and data transmission costs between different locations.
This benchmark presents a challenging problem: a central agent must dynamically decide where to assign incoming AI tasks or if they should be deferred. These decisions are made across a configurable cluster of data centers, with the goal of optimizing multiple objectives simultaneously.
How DCcluster-Opt Works
The simulation progresses in discrete 15-minute steps, mirroring real-world operational cadences and data availability. At each step, the agent receives information about pending tasks and the current state of all data centers. It then decides to either assign a task to a specific data center or defer it. The environment then simulates the consequences of these actions, including network delays for remote assignments, task execution, and updates to energy consumption and emissions. A modular reward system allows researchers to define and weigh different objectives, such as minimizing carbon emissions, energy costs, service level agreement (SLA) violations, and water use.
Key Features for Realistic Simulation
DCcluster-Opt’s realism is built upon several pillars:
- Physics-Informed Datacenter Model: Each data center simulates IT power and HVAC system energy, with performance adapting to IT load and ambient weather. It even supports advanced components like Heat Recovery Units (HRUs) that reuse waste heat.
- Transmission-Aware Network Model: Moving tasks between data centers incurs monetary costs, energy consumption, carbon emissions, and network delays, all based on empirical data.
- Dynamic Task & Environment Model: It uses real-world AI workload traces (like the Alibaba AI workload trace), along with real-time data streams for electricity prices, grid carbon intensity, and weather.
Evaluating Scheduling Strategies
The research demonstrates DCcluster-Opt’s utility by comparing various scheduling strategies, including simple rule-based controllers (like always choosing the lowest carbon or lowest price data center) and advanced reinforcement learning (RL) agents. The results highlight the complex trade-offs involved; for instance, a strategy focused solely on the lowest electricity price might lead to higher carbon emissions. The study shows that RL agents can learn policies that effectively balance multiple objectives, achieving lower total costs and competitive CO2 emissions, though sometimes at the expense of a slightly higher SLA violation rate due to task deferral.
Furthermore, the benchmark allows for evaluating the impact of advanced local data center controls. Integrating an RL-based HVAC controller, for example, significantly reduced total energy and CO2 emissions, and adding a Heat Recovery Unit further enhanced these savings, even reducing water usage.
Also Read:
- Glia: An AI Architecture for Autonomous System Design and Optimization
- Agentic AI’s Hidden Engine: The CPU’s Critical Role in Performance
Towards Trustworthy AI Controllers
The paper also explores the development of an agentic AI controller, which uses large language models (LLMs) to create more interpretable and auditable scheduling decisions. This system mimics a human operations team, with specialized agents for sensing, analyzing, planning, validating, acting, and monitoring. This approach aims to build trustworthy autonomous systems that can justify their decisions, adapt to changing objectives, and scale to different numbers of data centers without extensive retraining.
DCcluster-Opt provides a robust, configurable, and accessible testbed that will accelerate the development and validation of next-generation sustainable computing solutions for geo-distributed data centers. For more details, you can read the full research paper here.


