Aurora Supercomputer's DAOS and Lustre File Systems Set New Benchmarks for Large-Scale AI Storage Performance

TLDR: Argonne National Laboratory’s Aurora supercomputer has demonstrated exceptional performance in the MLPerf Storage Benchmark v2.0, with its DAOS and Lustre file systems achieving impressive throughputs crucial for large-scale AI model training and efficient checkpointing in the exascale era.

MLCommons has recently unveiled the results of its MLPerf Storage Benchmark v2.0, a critical evaluation designed to assess how storage systems perform under the demanding I/O requirements of large-scale machine learning workloads. Among the notable submissions, the U.S. Department of Energy’s (DOE) Argonne National Laboratory showcased the formidable capabilities of its Aurora supercomputer, highlighting the strengths of its DAOS and Lustre file systems in supporting modern AI training.

Training colossal AI models, such as LLaMA3-405B and LLaMA3-1T, necessitates rapid and reliable checkpointing. This vital process involves periodically saving the model’s state to safeguard against potential hardware or software failures. These operations demand the writing of massive data volumes within brief timeframes, as any delays can lead to significant idle compute time and a reduction in overall efficiency.

The MLPerf Storage Benchmark utilizes the Deep Learning I/O (DLIO) benchmark to accurately simulate the I/O operations that occur during AI training. The new AI checkpointing workloads, a key addition in the v2.0 release, were developed by Huihuo Zheng, a computer scientist at the Argonne Leadership Computing Facility (ALCF). Zheng is also a core developer of DLIO and co-chair of the MLPerf Storage working group.

To underscore Aurora’s prowess in handling such challenges, the ALCF, a DOE Office of Science user facility, submitted performance data for two distinct storage systems:

DAOS (Distributed Asynchronous Object Storage): This high-performance, open-source object store achieved remarkable results, reaching nearly 1 terabyte per second (TB/s) in write throughput and 600 gigabytes per second (GB/s) in read throughput. This was accomplished using a subset of just 128 out of the 1,024 DAOS servers on Aurora, enabling a full checkpoint of the LLaMA3-405B model in under 10 seconds.

Lustre: Powering the ALCF’s 100 petabyte (PB) Flare system, Lustre delivered sustained read and write throughputs ranging from 400 to 600 GB/s. This performance underscores its capability to operate effectively at near-peak capacity, even for the most demanding workloads.

Huihuo Zheng emphasized the significance of these achievements, stating, “Checkpointing plays a crucial role in protecting weeks of training progress when working with today’s largest models. The MLPerf benchmark provides a rigorous and realistic way to measure how storage systems handle these demands. Our results show that Aurora’s architecture delivers the performance and throughput required for large-scale AI training in the exascale era.”

Also Read:

Through its participation in MLPerf Storage v2.0, the ALCF continues to contribute to the advancement of benchmarks and technologies essential for enabling scalable, fault-tolerant AI training on cutting-edge high-performance computing systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Aurora Supercomputer’s DAOS and Lustre File Systems Set New Benchmarks for Large-Scale AI Storage Performance

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

OneShield Achieves Landmark Registration Under Cloud Security Alliance AI Controls Matrix, Setting New Industry Standard

SeedAI Leads Utah’s Proactive Initiative for Ethical AI Integration in Business

Bahrain Commended for AI Preparedness in New UNESCO Global Report

U.S. Air Force Secures Skydio Drone Technology for Enhanced Autonomous Operations

Malaysia Forges Ahead with AI Development, Prioritizing Governance and Ethical Frameworks

Contractify Honored as Top Contract Management Solution Provider for 2025 by LegalTech Breakthrough Awards

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

EPAM Honored with Microsoft’s 2025 Innovate with Azure AI Platform Partner of the Year Award for Pioneering AI Solutions

EBU Academy’s School of AI Honored with European Digital Skills Award for Upskilling Media Professionals

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Netherlands Unveils Ambitious AI Strategy to Shape Global Governance Frameworks

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Prepify AI and ZoraSafe, Inc. Honored with ‘Panelists’ Choice’ Awards at UF Innovate’s GatorPitch in Miami

Subscribe to get the latest news and updates