TLDR: OpenDerisk is an open-source, AI-driven multi-agent framework designed to automate and augment Site Reliability Engineering (SRE) tasks. It addresses the complexity of modern software by using specialized AI agents that collaborate, reason, and leverage a sophisticated knowledge base. Deployed at Ant Group, it has proven effective in improving diagnostic accuracy and efficiency for over 3,000 daily users across various scenarios, acting as a powerful “co-pilot” for SRE teams.
Modern software systems, with their intricate web of distributed microservices and rapid release cycles, have become incredibly complex. This complexity places an immense burden on Site Reliability Engineering (SRE) teams, who are tasked with ensuring these systems run smoothly. Traditionally, SRE involves deep investigative work, like Root Cause Analysis (RCA), which requires synthesizing vast amounts of data and applying expert knowledge. Existing AI solutions often fall short, either lacking the deep causal reasoning needed or not being specifically designed for SRE’s unique diagnostic workflows.
To address this critical gap, researchers have introduced OpenDerisk, a specialized, open-source multi-agent framework tailored for SRE. This innovative framework aims to augment, rather than simply automate, human expertise by emulating the investigative sense-making of an expert SRE. OpenDerisk integrates several core components: a diagnostic-native collaboration model, a pluggable reasoning engine, a sophisticated knowledge engine, and a standardized Model Context Protocol (MCP).
How OpenDerisk Works
At its heart, OpenDerisk operates on a multi-agent ReAct paradigm, which allows different specialized AI agents to collaborate and solve complex, multi-domain problems. The system’s workflow begins with a Perception Layer that ingests diverse signals like log alarms, anomalous application behavior, and environment changes. This data then feeds into the DeRisk System, the framework’s central nervous system.
Within the DeRisk System, a Multi-Agent System orchestrates a team of specialized agents, such as an OS-Agent or a Code-Agent, dynamically adapting its collaboration strategy based on the scenario. Each agent is equipped with a pluggable Reasoning Engine that supports various modes, from exploratory LLM ReAct Mode to deterministic Standard Operating Procedure (SOP) Mode. A powerful Knowledge Engine (K-Engine) grounds the agents’ analysis in domain-specific data. This K-Engine uses a five-stage pipeline to transform raw enterprise data into an actionable knowledge base, involving data parsing, intelligent chunking, semantic enrichment, hybrid indexing (including vector and knowledge graph indexes), and continuous active learning and updates. Agents interact with the live environment through a standardized set of tools, governed by the Model Context Protocol (MCP), ensuring extensibility.
Finally, the Analysis and Reporting Layer synthesizes the findings into human-readable outputs like Diagnostic Reports and Root Cause Locations. A crucial aspect of OpenDerisk is its Human-in-the-Loop (HITL) feedback mechanism, allowing SREs to provide guidance and corrections, which helps the system learn and improve continuously.
Also Read:
- MASSE: A Multi-Agent System Streamlines Structural Engineering with LLMs
- StepFly: Automating IT Incident Troubleshooting with AI Agents
Real-World Validation and Impact
OpenDerisk isn’t just a theoretical concept; it has been successfully deployed in production at Ant Group. This large-scale deployment serves over 3,000 daily users and executes more than 60,000 diagnostic runs per day. In just three months, it was adopted for 13 new application scenarios, with developers creating over 50 new specialized agents, demonstrating its industrial-grade scalability and practical impact. The framework’s effectiveness has been validated through comprehensive evaluations, showing significant improvements in accuracy and efficiency compared to traditional monolithic agent designs.
The research paper details how OpenDerisk’s multi-specialist agent framework consistently outperforms simpler agent architectures in task accuracy, even if it sometimes incurs a slightly longer execution time due to its increased complexity. It also highlights the framework’s adaptability, demonstrating its ability to integrate new, domain-specific knowledge and seamlessly adapt to different foundational Large Language Models (LLMs). An ablation study further confirmed that the collaborative specialization of its multi-agent architecture is critical for achieving robust, enterprise-grade diagnostics.
While OpenDerisk currently functions as an assistive ‘co-pilot’ requiring human oversight, its future roadmap aims to evolve it into a fully autonomous ‘pilot’ capable of executing safe, closed-loop remediation. This will involve advanced reinforcement learning to optimize agent tool-use and system-level collaboration. For more in-depth information, you can refer to the original research paper: OpenDerisk: An Industrial Framework for AI-Driven SRE.


