spot_img
HomeResearch & DevelopmentEnhancing Visual Question Answering with QA-Dragon's Adaptive Retrieval

Enhancing Visual Question Answering with QA-Dragon’s Adaptive Retrieval

TLDR: QA-Dragon is a new system designed to improve Visual Question Answering (VQA) by dynamically integrating external knowledge. It uses a query-aware approach with specialized routers to decide whether to search for text or images, and how to combine this information. Tested on the Meta CRAG-MM Challenge, QA-Dragon significantly boosts accuracy and knowledge overlap in complex VQA tasks, including multi-hop and multi-turn questions, by effectively mitigating issues like hallucinations and limited reasoning in Multimodal Large Language Models.

In the rapidly evolving field of Artificial Intelligence, Multimodal Large Language Models (MLLMs) have shown remarkable abilities in understanding and reasoning across both visual and linguistic information, particularly in tasks like Visual Question Answering (VQA). However, these advanced models often struggle with complex queries that demand deep, up-to-date knowledge or require multiple steps of reasoning, sometimes leading to inaccurate or fabricated answers, a phenomenon known as hallucination.

To tackle these limitations, researchers Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, and Qing Li from The Hong Kong Polytechnic University have introduced a groundbreaking system called QA-Dragon. This Query-Aware Dynamic Retrieval-Augmented Generation (RAG) System is specifically designed for knowledge-intensive VQA, aiming to provide more accurate and reliable answers by intelligently incorporating external knowledge.

What is QA-Dragon?

At its core, QA-Dragon is a sophisticated framework that moves beyond traditional RAG methods, which typically retrieve information from text or images in isolation. Instead, QA-Dragon orchestrates both text and image search agents in a hybrid setup, enabling it to handle complex VQA tasks that involve multimodal, multi-turn, and multi-hop reasoning.

How Does It Work?

QA-Dragon’s innovative approach is built upon several key modular components that work in harmony:

  • Domain Router: This component acts as an initial guide, identifying the subject domain of a query (e.g., ‘Vehicles’, ‘Food’, ‘Books’). This allows the system to apply domain-specific reasoning strategies, ensuring more relevant and accurate processing.
  • Pre-Answer Module (D-CoT): Before any external search, this module uses a Domain-specific Chain-of-Thought (D-CoT) process. It prompts the MLLM to generate a preliminary answer and a reasoning trace, helping the system understand what it confidently knows and where it might need more information.
  • Search Router: Based on the pre-answer and reasoning trace, this crucial component decides the most suitable execution path. It can opt for a ‘Direct Output’ if the answer is clear from the image, ‘Search Verify’ if the answer needs external factual verification, or ‘RAG’ (Retrieval-Augmented Generation) if the query requires synthesizing new information from external knowledge.
  • Tool Router: If external retrieval is needed, the Tool Router steps in to select the optimal modality for the search. It determines whether to use an ‘Image Search Agent’ (for identifying unknown objects) or a ‘Text Search Agent’ (for factual attributes not visible in the image), or even both.
  • Image Search Agent: This agent focuses on visual entities. It uses techniques like Multimodal Object Extraction and Segmentation to identify and crop specific objects from an image, then performs searches for visually similar items to infer object identities or gather related visual information.
  • Text Search Agent: For textual evidence, this agent employs Query Rephrasing to break down complex questions into clearer sub-queries. It also uses ‘Fusion Search’ to combine image-derived object information with the original query, creating more precise text search queries.
  • Coarse-to-fine Multimodal Reranker: After retrieval, this two-stage reranker filters and prioritizes the most relevant information from both image and text sources, ensuring that only high-quality evidence is used to augment answers.
  • Post-Answer Module: Finally, this module consolidates the retrieved evidence with the initial hypothesis to generate a final, verifiable response. It includes a CoT-based Answer Generation and a dual-verification mechanism to reduce hallucinations and ensure factual precision.

Also Read:

Performance and Impact

QA-Dragon was rigorously evaluated on the Meta CRAG-MM Challenge at KDD Cup 2025, a comprehensive benchmark for multimodal, multi-turn question answering. The results were impressive: QA-Dragon significantly enhanced the reasoning performance of base models, achieving substantial improvements in both answer accuracy and knowledge overlap scores. It outperformed baselines by 5.06% on single-source tasks, 6.35% on multi-source tasks, and 5.03% on multi-turn tasks.

Ablation studies further confirmed the importance of each component, showing that removing any part, such as the domain router or query splitting, led to a noticeable drop in performance. This highlights the framework’s robust and well-integrated design.

The research paper, available at https://arxiv.org/pdf/2508.05197, demonstrates that QA-Dragon represents a significant step forward in addressing the complexities of real-world VQA scenarios, offering a promising solution for more grounded and trustworthy AI systems.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -