Enhancing MLLM Accuracy: A New Method for Controlled Image Captioning

TLDR: A new method called Multimodal Reward-Guided Decoding (MRGD) allows real-time control over Multimodal Large Language Models (MLLMs) to reduce object hallucinations and balance object precision and recall in image captions. It uses two specialized reward models—one for hallucination reduction and one for object recall—to guide the model’s output generation, outperforming existing methods and offering flexible control over output quality and computational cost.

Multimodal Large Language Models (MLLMs) are becoming increasingly popular for various tasks that combine vision and language, such as image captioning. However, a key challenge with these powerful models is controlling their behavior to meet specific user needs, especially when it comes to accuracy and detail in their outputs. For instance, a user with visual impairment might need highly precise descriptions to avoid misleading information, while another user generating synthetic data might prioritize diverse and detailed outputs, even if it means tolerating some inaccuracies.

A new research paper titled “Controlling Multimodal LLMs via Reward-guided Decoding” introduces a novel method called Multimodal Reward-Guided Decoding (MRGD) to address this challenge. This approach allows for fine-grained control over MLLM outputs during the inference process, which is when the model generates its responses.

The Problem of Hallucinations and Control

One significant issue with MLLMs is “hallucinations,” where the model generates information that is not present in the image. Previous methods to reduce hallucinations, like prompting or fine-tuning, offer limited control during inference. For example, prompting relies on general instructions, and fine-tuning methods like Supervised Reward-Guided Decoding (SFT) or Reinforcement Learning with Human Feedback (RLHF) don’t allow for real-time adjustments to the model’s behavior once it’s trained.

The authors highlight two crucial aspects of control: managing the trade-off between object precision (avoiding hallucinations) and object recall (including all relevant objects), and balancing the quality of visual grounding with the computational resources used during generation.

Introducing Multimodal Reward-Guided Decoding (MRGD)

MRGD tackles these issues by building and utilizing specialized “reward models” that guide the MLLM’s decoding process. Unlike text-only models, MLLMs require reward models that can understand both visual and textual information simultaneously, which presents unique challenges.

The core of MRGD involves two distinct reward models:

Object Hallucination Reward Model (rhal): This model is trained on preference data, where it learns to distinguish between responses with and without object hallucinations. It uses PaliGemma as its backbone, a powerful vision-language model, and is fine-tuned to predict a score between 0 and 1, indicating how free from hallucinations a given caption is.
Object Recall Reward Model (rrec): This model is built by combining existing tools. It uses an object detector (OWLv2) to identify objects in the image and a word embedding model (Sentence-BERT) along with NLP tools (NLTK) to identify and compare objects mentioned in the generated caption. It then calculates a score based on how many of the detected objects from the image are correctly mentioned in the caption.

How MRGD Works

MRGD guides the MLLM’s generation by combining the scores from these two reward models. A user can dynamically adjust a “guidance strength” hyperparameter, ‘w’, between 0 and 1. If ‘w’ is 1, the model prioritizes hallucination reduction. If ‘w’ is 0, it prioritizes object recall. Values in between allow for a smooth trade-off.

During decoding, the MLLM generates multiple candidate completions for a partial response. Each candidate is then evaluated using the combined reward score. The candidate with the highest score is selected and added to the response, and this process continues until the full caption is generated. This search-based approach allows for more precise control than simple fine-tuning.

Also Read:

Experimental Results and Impact

The researchers evaluated MRGD on standard object hallucination benchmarks like CHAIR and AMBER, using MLLMs such as LLaVA-1.5, Llama-3.2-Vision, and SmolVLM-2. The results show that MRGD significantly reduces object hallucinations compared to greedy decoding and consistently outperforms existing hallucination mitigation methods, including other fine-tuning and guided decoding approaches.

A key finding is the inherent trade-off between object precision and recall in MLLMs. MRGD allows users to navigate this trade-off effectively. For instance, setting ‘w’ to 1.0 drastically reduces hallucinations, while a lower ‘w’ (e.g., 0.0) boosts object recall, albeit with a higher hallucination rate. The method also demonstrates a trade-off between visual grounding quality and computational cost, where increasing the number of candidate samples (‘k’) improves grounding but requires more compute.

Importantly, the reward models developed for MRGD can be applied to new MLLMs without requiring retraining, making the approach highly adaptable. This flexibility and granular control over MLLM outputs represent a significant step forward in making these powerful models more reliable and user-friendly. For more technical details, you can refer to the full research paper: Controlling Multimodal LLMs via Reward-guided Decoding.

While the current work focuses on object hallucinations, the authors suggest future work could extend MRGD to mitigate other types of visual hallucinations, explore reward models for semantically incomplete outputs, and apply the method to discriminative tasks.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing MLLM Accuracy: A New Method for Controlled Image Captioning

The Problem of Hallucinations and Control

Introducing Multimodal Reward-Guided Decoding (MRGD)

How MRGD Works

Experimental Results and Impact

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates