Enhancing Visual Question Answering with QA-Dragon's Adaptive Retrieval

TLDR: QA-Dragon is a new system designed to improve Visual Question Answering (VQA) by dynamically integrating external knowledge. It uses a query-aware approach with specialized routers to decide whether to search for text or images, and how to combine this information. Tested on the Meta CRAG-MM Challenge, QA-Dragon significantly boosts accuracy and knowledge overlap in complex VQA tasks, including multi-hop and multi-turn questions, by effectively mitigating issues like hallucinations and limited reasoning in Multimodal Large Language Models.

In the rapidly evolving field of Artificial Intelligence, Multimodal Large Language Models (MLLMs) have shown remarkable abilities in understanding and reasoning across both visual and linguistic information, particularly in tasks like Visual Question Answering (VQA). However, these advanced models often struggle with complex queries that demand deep, up-to-date knowledge or require multiple steps of reasoning, sometimes leading to inaccurate or fabricated answers, a phenomenon known as hallucination.

To tackle these limitations, researchers Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, and Qing Li from The Hong Kong Polytechnic University have introduced a groundbreaking system called QA-Dragon. This Query-Aware Dynamic Retrieval-Augmented Generation (RAG) System is specifically designed for knowledge-intensive VQA, aiming to provide more accurate and reliable answers by intelligently incorporating external knowledge.

What is QA-Dragon?

At its core, QA-Dragon is a sophisticated framework that moves beyond traditional RAG methods, which typically retrieve information from text or images in isolation. Instead, QA-Dragon orchestrates both text and image search agents in a hybrid setup, enabling it to handle complex VQA tasks that involve multimodal, multi-turn, and multi-hop reasoning.

How Does It Work?

QA-Dragon’s innovative approach is built upon several key modular components that work in harmony:

Domain Router: This component acts as an initial guide, identifying the subject domain of a query (e.g., ‘Vehicles’, ‘Food’, ‘Books’). This allows the system to apply domain-specific reasoning strategies, ensuring more relevant and accurate processing.
Pre-Answer Module (D-CoT): Before any external search, this module uses a Domain-specific Chain-of-Thought (D-CoT) process. It prompts the MLLM to generate a preliminary answer and a reasoning trace, helping the system understand what it confidently knows and where it might need more information.
Search Router: Based on the pre-answer and reasoning trace, this crucial component decides the most suitable execution path. It can opt for a ‘Direct Output’ if the answer is clear from the image, ‘Search Verify’ if the answer needs external factual verification, or ‘RAG’ (Retrieval-Augmented Generation) if the query requires synthesizing new information from external knowledge.
Tool Router: If external retrieval is needed, the Tool Router steps in to select the optimal modality for the search. It determines whether to use an ‘Image Search Agent’ (for identifying unknown objects) or a ‘Text Search Agent’ (for factual attributes not visible in the image), or even both.
Image Search Agent: This agent focuses on visual entities. It uses techniques like Multimodal Object Extraction and Segmentation to identify and crop specific objects from an image, then performs searches for visually similar items to infer object identities or gather related visual information.
Text Search Agent: For textual evidence, this agent employs Query Rephrasing to break down complex questions into clearer sub-queries. It also uses ‘Fusion Search’ to combine image-derived object information with the original query, creating more precise text search queries.
Coarse-to-fine Multimodal Reranker: After retrieval, this two-stage reranker filters and prioritizes the most relevant information from both image and text sources, ensuring that only high-quality evidence is used to augment answers.
Post-Answer Module: Finally, this module consolidates the retrieved evidence with the initial hypothesis to generate a final, verifiable response. It includes a CoT-based Answer Generation and a dual-verification mechanism to reduce hallucinations and ensure factual precision.

Also Read:

Performance and Impact

QA-Dragon was rigorously evaluated on the Meta CRAG-MM Challenge at KDD Cup 2025, a comprehensive benchmark for multimodal, multi-turn question answering. The results were impressive: QA-Dragon significantly enhanced the reasoning performance of base models, achieving substantial improvements in both answer accuracy and knowledge overlap scores. It outperformed baselines by 5.06% on single-source tasks, 6.35% on multi-source tasks, and 5.03% on multi-turn tasks.

Ablation studies further confirmed the importance of each component, showing that removing any part, such as the domain router or query splitting, led to a noticeable drop in performance. This highlights the framework’s robust and well-integrated design.

The research paper, available at https://arxiv.org/pdf/2508.05197, demonstrates that QA-Dragon represents a significant step forward in addressing the complexities of real-world VQA scenarios, offering a promising solution for more grounded and trustworthy AI systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Visual Question Answering with QA-Dragon’s Adaptive Retrieval

What is QA-Dragon?

How Does It Work?

Performance and Impact

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates