spot_img
HomeResearch & DevelopmentAI-Powered Analytics for High-Performance Computing Operations: Introducing the EPIC...

AI-Powered Analytics for High-Performance Computing Operations: Introducing the EPIC Platform

TLDR: EPIC is an AI-driven platform designed to accelerate operational data analytics in High-Performance Computing (HPC) systems. It utilizes a hierarchical multi-agent architecture, where a top-level large language model orchestrates specialized low-level agents for information retrieval, descriptive analytics, and predictive analytics. This enables dynamic and iterative analysis of multi-modal data (text, images, tabular). Evaluations on the Frontier HPC system show that EPIC effectively handles complex queries, with fine-tuned smaller models achieving higher accuracy than larger foundation models for specific tasks, and demonstrating significant cost savings (up to 19x) compared to proprietary solutions.

High-Performance Computing (HPC) systems are the backbone of scientific discovery and complex simulations, generating vast amounts of operational data. However, extracting meaningful insights from this multi-terabyte data stream has traditionally been a challenging and time-consuming task. Conventional operational data analytics (ODA) methods often rely on static approaches, struggling to adapt to the rapidly evolving needs of diverse stakeholders and the dynamic nature of HPC environments.

Addressing these limitations, researchers at the National Center for Computational Sciences, Oak Ridge National Laboratory, have introduced EPIC: a groundbreaking AI-driven platform designed to accelerate HPC operational data analytics. EPIC stands for a new paradigm in how we interact with and understand the intricate workings of exascale computing systems.

A Hierarchical Multi-Agent Approach

At its core, EPIC employs a sophisticated hierarchical multi-agent architecture. Imagine a highly intelligent manager (a top-level large language model, or LLM) that oversees a team of specialized experts. This manager is responsible for processing complex user queries, reasoning through the problem, and synthesizing the final answers. It then orchestrates three specialized low-level agents, each designed for a specific analytical task:

  • Information Retrieval (IR) Agent: This agent is the knowledge expert, capable of sifting through unstructured data like HPC user manuals, academic papers, and web content. It can retrieve relevant domain information, including text, tables, and images, providing a rich context for analysis.
  • Descriptive Analytics (DA) Agent: The DA agent focuses on understanding ‘what happened’ in the past. It interacts with large, structured telemetry databases, translating natural language queries into SQL commands to perform exploratory data analysis and generate insights from historical data.
  • Predictive Analytics (PA) Agent: Looking to the future, the PA agent handles tasks that estimate HPC job metrics, such as power utilization, energy consumption, and compute node temperature. It leverages regression models trained on past data to make informed predictions.

This modular design allows EPIC to handle multi-modal data – including text, images, and tabular formats – dynamically and iteratively. It moves beyond static dashboards, offering an on-demand and interactive analytical environment that can adapt to continuously evolving inquiries.

Key Advantages and Performance

EPIC has been rigorously evaluated on the Frontier HPC system, demonstrating its effectiveness in handling complex queries. One of its significant achievements is its ability to perform descriptive analytics with remarkable accuracy. In fact, fine-tuned smaller models within EPIC have been shown to outperform large, state-of-the-art foundation models, achieving up to 26% higher accuracy for specific tasks. This highlights a crucial finding: for specialized analytical tasks, smaller, purpose-built models can be more effective than their larger, general-purpose counterparts.

Beyond performance, EPIC also delivers substantial cost savings. By adopting a hybrid approach that combines powerful proprietary foundational models with local, open-weight models, the platform achieved an impressive 19x savings in LLM operational costs compared to purely proprietary solutions. This makes advanced HPC operational analytics more accessible and sustainable.

The platform also features a multi-modal rendering user interface, allowing users to interact with the system via a chat-bot style web UI. This interface not only provides text responses but also dynamic tables and interactive plots, which users can customize to fit their specific needs. Crucially, the UI also displays the agent’s intermediate steps and tool calls, offering transparency into its reasoning process.

Also Read:

Looking Ahead

EPIC represents a significant step forward in leveraging generative AI for HPC operational data analytics. By automating reasoning, interaction with diverse data, and synthesis of results, it helps bridge the gap between raw data and actionable insights. The platform’s modular and extensible design ensures it can continue to evolve, integrating new capabilities and adapting to the future demands of exascale computing. For more details, you can refer to the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -