TLDR: Argonne National Laboratory’s Aurora supercomputer has demonstrated exceptional performance in the MLPerf Storage Benchmark v2.0, with its DAOS and Lustre file systems achieving impressive throughputs crucial for large-scale AI model training and efficient checkpointing in the exascale era.
MLCommons has recently unveiled the results of its MLPerf Storage Benchmark v2.0, a critical evaluation designed to assess how storage systems perform under the demanding I/O requirements of large-scale machine learning workloads. Among the notable submissions, the U.S. Department of Energy’s (DOE) Argonne National Laboratory showcased the formidable capabilities of its Aurora supercomputer, highlighting the strengths of its DAOS and Lustre file systems in supporting modern AI training.
Training colossal AI models, such as LLaMA3-405B and LLaMA3-1T, necessitates rapid and reliable checkpointing. This vital process involves periodically saving the model’s state to safeguard against potential hardware or software failures. These operations demand the writing of massive data volumes within brief timeframes, as any delays can lead to significant idle compute time and a reduction in overall efficiency.
The MLPerf Storage Benchmark utilizes the Deep Learning I/O (DLIO) benchmark to accurately simulate the I/O operations that occur during AI training. The new AI checkpointing workloads, a key addition in the v2.0 release, were developed by Huihuo Zheng, a computer scientist at the Argonne Leadership Computing Facility (ALCF). Zheng is also a core developer of DLIO and co-chair of the MLPerf Storage working group.
To underscore Aurora’s prowess in handling such challenges, the ALCF, a DOE Office of Science user facility, submitted performance data for two distinct storage systems:
DAOS (Distributed Asynchronous Object Storage): This high-performance, open-source object store achieved remarkable results, reaching nearly 1 terabyte per second (TB/s) in write throughput and 600 gigabytes per second (GB/s) in read throughput. This was accomplished using a subset of just 128 out of the 1,024 DAOS servers on Aurora, enabling a full checkpoint of the LLaMA3-405B model in under 10 seconds.
Lustre: Powering the ALCF’s 100 petabyte (PB) Flare system, Lustre delivered sustained read and write throughputs ranging from 400 to 600 GB/s. This performance underscores its capability to operate effectively at near-peak capacity, even for the most demanding workloads.
Huihuo Zheng emphasized the significance of these achievements, stating, “Checkpointing plays a crucial role in protecting weeks of training progress when working with today’s largest models. The MLPerf benchmark provides a rigorous and realistic way to measure how storage systems handle these demands. Our results show that Aurora’s architecture delivers the performance and throughput required for large-scale AI training in the exascale era.”
Also Read:
- MinIO Unveils Academy to Bridge AI Data Storage Skill Gap
- AWS SageMaker HyperPod Enhances Resource Management with Granular Compute Quotas for AI Workloads
Through its participation in MLPerf Storage v2.0, the ALCF continues to contribute to the advancement of benchmarks and technologies essential for enabling scalable, fault-tolerant AI training on cutting-edge high-performance computing systems.


