New AI Framework Uses Multi-Agent System for Better Cancer Diagnosis from Pathology Images

TLDR: The GMAT framework introduces a multi-agent system to generate detailed, clinically accurate descriptions from pathology textbooks. These descriptions, used as a list rather than a single prompt, improve Vision-Language Models (VLMs) for whole slide image classification, leading to better performance in cancer diagnosis on renal and lung cancer datasets.

In the evolving field of digital pathology, accurately classifying whole slide images (WSIs) is crucial for cancer diagnosis. These images are incredibly large, often gigapixels in size, and contain complex tissue patterns, making their analysis a significant challenge. Traditional methods, while effective, often struggle with the sheer scale and intricate details present in these images.

Multiple Instance Learning (MIL) has emerged as a leading approach to tackle WSI classification. MIL treats a WSI as a ‘bag’ of smaller image patches, with the diagnostic label applied to the entire slide rather than individual patches. More recently, Vision-Language Models (VLMs) have been integrated into MIL pipelines. These models aim to combine visual information from images with textual medical knowledge, using text-based descriptions of diseases instead of just simple class names. This integration helps in incorporating rich medical context into the classification process.

However, current VLM-based methods face limitations. When large language models (LLMs) are used to generate clinical descriptions, or when fixed-length prompts represent complex pathology concepts, the limited token capacity of VLMs can restrict the depth and richness of the encoded medical information. Furthermore, descriptions generated solely by LLMs might lack the specific domain grounding and fine-grained medical detail required for accurate pathology, potentially leading to a mismatch with visual features.

Introducing GMAT: A Grounded Multi-Agent Approach

To overcome these challenges, researchers have proposed a novel vision-language MIL framework called GMAT. This framework introduces two key innovations: first, a grounded multi-agent description generation system, and second, a text encoding strategy that uses a list of descriptions rather than a single prompt.

The core of GMAT is its multi-agent system, GMATG (Grounded Multi-Agent Text Generation). This system leverages carefully selected pathology textbooks as a structured knowledge base. Instead of a single entity generating descriptions, GMATG employs a team of specialized agents, each with a distinct role:

Planning Agent: This agent creates a detailed guide for describing a specific cancer type, outlining the structure, rules for analyzing cell and tissue features, necessary clinical information, and quality standards. It produces a markdown plan to guide the other agents.
Generate Agent: Using the plan from the Planning Agent and the shared knowledge base, this agent composes an initial draft of the class description.
Verify Agent: This agent reviews the generated description for medical accuracy, completeness, and consistent terminology based on pathology standards. It provides a corrected version along with a quality report and recommendations.
Finalize Agent: The final agent converts the approved description into a structured JSON file. This file contains a list of short clinical sentences, ordered from general to microscopic, molecular, and clinical details, ensuring proper formatting and concise language.

This collaborative, multi-agent workflow ensures that the generated class descriptions are clinically grounded, semantically rich, and highly accurate, providing a robust foundation for downstream vision-language MIL classification.

How GMAT Integrates Vision and Language

The GMAT model integrates GMATG’s descriptions into a vision-language MIL pipeline. It uses CONCH, a powerful encoder, for both image patches and text descriptions. Whole slide images are divided into patches at different magnifications (e.g., 5× and 10×), which are then processed by the CONCH visual encoder and mapped into a shared embedding space. For the text component, the multiple class-specific descriptions generated by GMATG are tokenized and encoded using the frozen CONCH text encoder.

To align the visual and textual information, GMAT computes the similarity between each image patch embedding and all description embeddings. These similarity scores are then aggregated to produce class-level scores. An attention-based mechanism, adapted from the CLAM model, further weights and combines these patch-level scores to generate a slide-level prediction.

Also Read:

Experimental Validation and Impact

The effectiveness of GMAT was evaluated on two significant cancer subtyping datasets: TCGA-RCC (Renal Cell Carcinoma) and TCGA-Lung (Lung Adenocarcinoma and Lung Squamous Cell Carcinoma). The researchers compared GMAT’s performance against existing methods in both zero-shot (without specific training on the dataset) and fine-tuned settings.

In the zero-shot setting, using a list of descriptions generated by GMATG consistently improved performance over a single class description, particularly on the TCGA-Lung dataset. This indicates that GMATG provides more informative and discriminative prompts even without extensive training.

When fine-tuned, GMAT achieved performance comparable to, and in some cases, slightly better than state-of-the-art models like ViLa-MIL. On the TCGA-Lung dataset, GMAT showed better results across all metrics, demonstrating the value of its structured, multi-agent descriptions during the training process.

An ablation study further confirmed the benefit of the multi-agent design, showing that the collaborative system performed slightly better than a simpler single-agent version. This highlights that the structured planning, generation, and review process by specialized agents leads to more accurate and comprehensive descriptions.

In conclusion, GMAT represents a significant advancement in vision-language MIL for whole slide image classification. By generating clinically grounded, list-based prompts through a sophisticated multi-agent system, GMAT effectively captures diverse and structured medical descriptions. This approach not only improves performance in both zero-shot and fine-tuned settings but also enhances the interpretability of computational pathology models, paving the way for more accurate and reliable cancer diagnoses. You can read the full research paper here: GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New AI Framework Uses Multi-Agent System for Better Cancer Diagnosis from Pathology Images

Introducing GMAT: A Grounded Multi-Agent Approach

How GMAT Integrates Vision and Language

Experimental Validation and Impact

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates