TLDR: VaccineRAG is a new framework that improves Multimodal Large Language Models’ (MLLMs) ability to handle irrelevant or misleading information in Retrieval Augmented Generation (RAG) systems. It introduces a Chain-of-Thought (CoT) based dataset for detailed sample analysis and a novel training method called Partial-GRPO. This approach allows MLLMs to explicitly diagnose the helpfulness of retrieved samples, leading to significantly enhanced robustness and accuracy, even when faced with ‘polluted’ retrieval results.
Large Language Models (LLMs) have become incredibly powerful, but keeping them updated with the latest information is a constant challenge. Traditional methods like fine-tuning are costly and time-consuming. This is where Retrieval Augmented Generation (RAG) comes in, allowing LLMs to access external knowledge bases for real-time information. However, RAG systems often face a significant hurdle: the quality of the retrieved information. If the retriever brings in irrelevant or misleading data, the LLM’s output can be compromised, leading to inaccurate responses.
A new research paper introduces a solution called VaccineRAG, designed to make Multimodal Large Language Models (MLLMs) more resilient to these ‘harmful’ RAG samples. The core idea is to improve the model’s ability to discriminate between helpful and unhelpful information, much like a vaccine boosts immunity.
Understanding the Problem
Current RAG systems often prioritize speed, which can lead to retrieving information that looks similar on the surface but is semantically incorrect. Previous efforts have tried to improve the retriever itself or design better retrieval pipelines. While these methods can reduce harmful evidence, they assume the retriever’s accuracy remains stable. In reality, retrieval quality can degrade due to factors like changes in the knowledge base, leaving LLMs vulnerable.
Another approach, SURf, aimed to enhance robustness without improving the retriever by providing both relevant and irrelevant samples. However, it lacked clear diagnostic signals for error attribution and suffered from slow learning when dealing with many retrieved samples, as it relied on a single overall loss signal.
Introducing VaccineRAG
VaccineRAG tackles these issues by leveraging the deep reasoning capabilities of LLMs through a Chain-of-Thought (CoT) approach. Instead of just giving a final answer, the model is prompted to analyze each retrieved sample step-by-step, determining its helpfulness, summarizing relevant parts, and then generating the final answer.
To facilitate this, the researchers created a novel CoT-based Multimodal RAG dataset, also named VaccineRAG. This dataset is built upon the WebQA dataset and includes detailed CoT annotations. Each entry pairs a question with an image and caption, along with a helpfulness label. Crucially, it provides explanations for why each sample is helpful or unhelpful and how the final answer is derived. This involves two levels of reasoning: analyzing individual samples for relevance and then integrating helpful samples to form a conclusion.
The annotation process involved using a state-of-the-art commercial large model (GPT-4o) for initial annotations, followed by rigorous manual verification to ensure high quality. The dataset comprises approximately 10,000 entries, each with an average of five retrieved samples, including both image and text references.
The Partial-GRPO Methodology
To train models effectively on this complex CoT data, the researchers proposed Partial-GRPO (Gradient-based Reward Policy Optimization). This method enhances the model’s ability to learn long and intricate CoT content by treating the LLM’s outputs not as a single whole, but as multiple distinct components. This allows for more informed preference selections for complex sequences.
Partial-GRPO uses three specific reward functions:
-
Format Reward: Encourages the model to adhere to the required output template, ensuring structured responses.
-
Helpfulness Reward: Evaluates if the model correctly identifies the helpfulness of each retrieved sample.
-
Conclusion Reward: Checks if the model accurately cites the previously identified ‘helpful’ samples in its summary and avoids citing ‘unhelpful’ ones.
Unlike traditional GRPO, which applies a uniform reward across the entire output, Partial-GRPO enables targeted gradient backpropagation for different segments of the CoT (helpfulness analysis, conclusion, final answer). This significantly accelerates convergence and improves performance, as it provides more fine-grained reward signals.
Experimental Validation
The effectiveness of VaccineRAG and Partial-GRPO was validated through comprehensive experiments using mainstream multimodal large models like Qwen2-VL, Qwen2.5-VL, and InternVL3. The evaluation focused on two main scenarios:
-
Polluted Generation: Harmful retrieval samples were incrementally added to helpful ones. VaccineRAG with Partial-GRPO significantly improved the model’s Accuracy Degradation Rate, meaning it maintained higher accuracy even with more misleading information. It also showed better Mean Accuracy compared to baselines like Zero-shot, SURf, and traditional GRPO.
-
TopK Generation: A fixed number of top-K samples were retrieved. The trained models consistently achieved better performance with larger K values (more retrieval samples), demonstrating their ability to effectively utilize helpful information while being less affected by harmful ones. Untrained models, in contrast, often saw accuracy decline as K increased.
Ablation studies confirmed the indispensability of all three reward functions in Partial-GRPO. Removing any one of them led to significant deterioration in performance or even model collapse, highlighting their crucial role in guiding the model’s learning process.
Also Read:
- Unlocking Advanced Question Answering with KERAG’s Knowledge Graph Approach
- Enhancing Conversational Search Through Iterative Clarification and Rewriting
Looking Ahead
VaccineRAG represents a significant step towards more robust and reliable RAG systems for MLLMs. By focusing on fine-grained reasoning and sample analysis through its novel dataset and the Partial-GRPO training approach, it addresses a critical bottleneck in current RAG implementations. This work paves the way for LLM applications that are less susceptible to misleading information, ultimately leading to more accurate and trustworthy responses. The code and dataset will be publicly released soon, allowing other researchers to build upon this important advancement. You can read the full research paper here.


