TLDR: A new voice-directed AI platform, SAOP, uses a hierarchical multi-agent system powered by Large Language Models to allow surgeons to access and manipulate patient data (clinical info, CT scans, 3D models) during da Vinci robotic surgery without interrupting the procedure. It demonstrates high accuracy and robustness to speech errors and varied commands, showing significant potential for improving surgical efficiency by enabling hands-free interaction with critical patient information.
In the demanding environment of da Vinci robotic surgery, surgeons face a unique challenge: their hands and eyes are fully committed to the intricate procedure, making it incredibly difficult to access and interact with crucial patient data without causing interruptions. This often means shifting attention away from the surgical console to external monitors or auxiliary interfaces, which can be a significant hindrance when dealing with multimodal data like clinical information, CT scans, MRI images, and 3D anatomical models.
Addressing this critical need, researchers have developed a novel voice-directed platform called the Surgical Agent Orchestrator Platform (SAOP). This innovative system is built upon a hierarchical multi-agent framework, leveraging the power of Large Language Models (LLMs) to create an intelligent assistant for surgeons.
The SAOP consists of a central orchestration agent and three specialized task-specific agents. These LLM-driven agents are designed to autonomously plan, refine, validate, and reason, effectively translating a surgeon’s voice commands into specific actions. For instance, a surgeon can simply speak a command to retrieve clinical information, manipulate CT scans, or navigate 3D anatomical models directly on the surgical video feed.
The workflow orchestrator agent acts as the brain of the system, making probabilistic decisions about which function to execute next. It manages a series of workflow functions, including real-time audio capture, speech-to-text transcription, command correction and validation using LLMs, and command reasoning to route the request to the appropriate task agent.
The three task-specific agents are:
Information Retrieval (IR) Agent
This agent is responsible for retrieving and overlaying relevant clinical information onto the surgical video. Surgeons can ask for patient age, PFT (Pulmonary Function Test) results, or other specific data, and the IR agent will display it in a clear, formatted manner.
Image Viewer (IV) Agent
The IV agent allows surgeons to interact with CT DICOM images. They can scroll through axial, coronal, and sagittal planes, zoom in or out, and move to specific slices using voice commands. This provides dynamic access to detailed anatomical views during surgery.
Also Read:
- AI Agents Enhance Rare Disease Diagnosis in Brain MRI
- Enhancing AI Control Through Instruction Prioritization
Anatomy Rendering (AR) Agent
This agent renders 3D anatomical models, reconstructed from CT images, directly on the surgical video. Surgeons can select specific structures, change viewpoints (e.g., anterior, posterior, surgical view), rotate the model, and zoom in on particular areas of interest, offering an interactive visualization of complex anatomy.
A key strength of SAOP lies in its ability to handle free-form voice commands, moving beyond systems that rely on predefined phrases. The LLM-based agents enhance robustness against common challenges like speech recognition errors and ambiguous phrasing, as they can correct and reason about the surgeon’s intent.
To thoroughly evaluate the platform’s performance, the researchers introduced a Multi-level Orchestration Evaluation Metric (MOEM). Tested with 240 voice commands, the SAOP demonstrated high accuracy and success rates. It showed strong error recovery capabilities, particularly from initial speech-to-text transcription errors, with the system often correcting these issues in subsequent stages. The workflow-level success rate reached an impressive 95.8% under multi-pass conditions, indicating that the system can recover from invalid commands by prompting the user to restate them more clearly.
The platform also proved robust to variations in speaker voices, including synthesized neural voices and human speech, maintaining stable performance. Furthermore, in a real-world simulation, the wake-word detection for “davinci” showed a false alarm rate of zero over one hour of testing, preventing unintended activations in a busy operating room.
While the SAOP represents a significant step forward, the researchers acknowledge areas for future improvement. These include fine-tuning speech-to-text models for medical terminology, enhancing the handling of complex multi-step composite commands, and developing self-evolving orchestration mechanisms to adapt to different LLMs and surgeon-specific patterns. The current work, detailed in the paper available at arXiv:2511.07392, establishes a flexible and scalable platform with strong potential to enhance minimally invasive da Vinci robotic surgery by providing seamless, voice-directed access to critical patient data.


