TLDR: InferA is a multi-agent AI system designed to analyze extremely large cosmological simulation datasets, such as those from HACC, which can be petabytes in size. It uses large language models (LLMs) and a specialized architecture to interact with users, understand their analytical intent, and perform data retrieval and analysis without requiring full data ingestion. Key features include intelligent data subsetting, a sandboxed execution environment for code, comprehensive provenance tracking for reproducibility, and an iterative error correction mechanism. InferA significantly reduces storage and memory requirements, making complex scientific analyses accessible on standard hardware, and demonstrates high success rates in processing multi-terabyte datasets.
Scientists are constantly pushing the boundaries of what we can understand about the universe, generating massive amounts of data from complex simulations. However, analyzing these colossal datasets, often reaching terabytes or even petabytes, presents a significant challenge. Traditional data analysis tools struggle with the sheer volume and intricate structure of this scientific information, often requiring full data ingestion into memory, which is simply impractical for such scales.
Introducing InferA: A New Approach to Scientific Data Analysis
To tackle these limitations, researchers have developed InferA, a groundbreaking multi-agent system designed to act as a smart assistant for analyzing cosmological ensemble data. InferA leverages the power of large language models (LLMs) to enable scalable and efficient scientific data analysis, particularly for datasets generated by simulations like the Hardware/Hybrid Accelerated Cosmology Code (HACC), which models the evolution of the universe.
At its core, InferA operates with a sophisticated multi-agent architecture. A central supervisor agent orchestrates a team of specialized agents, each responsible for a distinct phase of data retrieval and analysis. This collaborative approach allows the system to engage interactively with users, understanding their analytical intent and confirming query objectives to ensure alignment between user goals and the system’s actions.
Overcoming Data Challenges with Smart Design
One of InferA’s most significant contributions is its ability to interact safely with and reason over extremely large scientific datasets without requiring full data ingestion. Unlike tools such as PandasAI, which demand the entire dataset to be in memory, InferA intelligently filters and loads only the necessary portions of data. For instance, a single time step from a HACC simulation can be around 540 GB, and a full run can have over 500 time steps, making in-memory analysis impossible. InferA reduces the required data from multiple terabytes to a few gigabytes at most, storing selected data in a DuckDB database to minimize memory and storage overhead.
The system’s workflow is divided into two main stages: planning and analysis. In the planning stage, a dedicated agent engages in a multi-turn dialogue with the user to understand their request and generate a step-by-step analytical plan. This plan is refined through human feedback, ensuring it accurately reflects the user’s intent. Once approved, the analysis stage begins, with the supervisor agent delegating tasks to specialized agents like a data-loading agent, an SQL programming agent, a Python programming agent, and a visualization agent.
Key Innovations and Features
InferA incorporates several innovative features to enhance its capabilities:
- RAG-enabled Metadata Extraction: Scientific datasets often have ambiguous column labels (e.g., “sod_halo_MGas500c”). InferA uses Retrieval-Augmented Generation (RAG) to extract and interpret metadata, creating dictionaries that map these labels to context-rich natural language descriptions. This allows the system to understand domain-specific terminology and retrieve relevant column names with high precision.
- Code Error Detection and Correction: Recognizing that LLMs can generate imperfect code, InferA includes an iterative loop for code testing, error checking, and quality assessment within a sandboxed environment. This ensures data integrity by preventing modifications to the original data and allows the system to refine its code until it’s valid and effective.
- Provenance Tracking: Reproducibility is crucial in scientific research. InferA maintains comprehensive records of all operations, including AI-generated code, intermediate data, and outputs. This detailed audit trail makes it straightforward for researchers to verify and reproduce analytical pathways.
- Human-in-the-Loop: While capable of automation, InferA is designed for human collaboration. Users can provide feedback at key steps, significantly improving the system’s efficiency and accuracy, especially when dealing with ambiguous queries or minor code errors.
Also Read:
- FIRST: Bringing AI Inference to Scientific High-Performance Computing
- Unveiling the Cosmos: How Deep Learning Transforms Astronomical Discovery
Real-World Impact and Future Directions
The researchers evaluated InferA using ensemble runs from the HACC cosmology simulation, totaling 1.4 TB of data. The system demonstrated high task completion rates (85% of all runs) and successfully produced valid data analysis and visualization outcomes. In a notable case study, InferA processed 32 simulation runs, an astounding 11.2 TB of data, reducing the storage overhead to less than 0.35% of the original size. This allows complex analyses to be performed on standard computational hardware, a feat previously requiring substantial high-performance computing (HPC) resources or extensive manual data reduction.
InferA represents a significant leap forward in making large-scale scientific data analysis more accessible and efficient. Its modular design also allows for straightforward adaptation to other scientific domains by implementing domain-specific data loaders and metadata dictionaries. Future work aims to integrate a web agent to review online data sources, combining HACC simulation data with information from publications, and to investigate parallelized workflow execution to further reduce runtime.
For more detailed information, you can read the full research paper here.


