spot_img
HomeResearch & DevelopmentEnhancing MLLM Accuracy: A New Method for Controlled Image...

Enhancing MLLM Accuracy: A New Method for Controlled Image Captioning

TLDR: A new method called Multimodal Reward-Guided Decoding (MRGD) allows real-time control over Multimodal Large Language Models (MLLMs) to reduce object hallucinations and balance object precision and recall in image captions. It uses two specialized reward models—one for hallucination reduction and one for object recall—to guide the model’s output generation, outperforming existing methods and offering flexible control over output quality and computational cost.

Multimodal Large Language Models (MLLMs) are becoming increasingly popular for various tasks that combine vision and language, such as image captioning. However, a key challenge with these powerful models is controlling their behavior to meet specific user needs, especially when it comes to accuracy and detail in their outputs. For instance, a user with visual impairment might need highly precise descriptions to avoid misleading information, while another user generating synthetic data might prioritize diverse and detailed outputs, even if it means tolerating some inaccuracies.

A new research paper titled “Controlling Multimodal LLMs via Reward-guided Decoding” introduces a novel method called Multimodal Reward-Guided Decoding (MRGD) to address this challenge. This approach allows for fine-grained control over MLLM outputs during the inference process, which is when the model generates its responses.

The Problem of Hallucinations and Control

One significant issue with MLLMs is “hallucinations,” where the model generates information that is not present in the image. Previous methods to reduce hallucinations, like prompting or fine-tuning, offer limited control during inference. For example, prompting relies on general instructions, and fine-tuning methods like Supervised Reward-Guided Decoding (SFT) or Reinforcement Learning with Human Feedback (RLHF) don’t allow for real-time adjustments to the model’s behavior once it’s trained.

The authors highlight two crucial aspects of control: managing the trade-off between object precision (avoiding hallucinations) and object recall (including all relevant objects), and balancing the quality of visual grounding with the computational resources used during generation.

Introducing Multimodal Reward-Guided Decoding (MRGD)

MRGD tackles these issues by building and utilizing specialized “reward models” that guide the MLLM’s decoding process. Unlike text-only models, MLLMs require reward models that can understand both visual and textual information simultaneously, which presents unique challenges.

The core of MRGD involves two distinct reward models:

  • Object Hallucination Reward Model (rhal): This model is trained on preference data, where it learns to distinguish between responses with and without object hallucinations. It uses PaliGemma as its backbone, a powerful vision-language model, and is fine-tuned to predict a score between 0 and 1, indicating how free from hallucinations a given caption is.

  • Object Recall Reward Model (rrec): This model is built by combining existing tools. It uses an object detector (OWLv2) to identify objects in the image and a word embedding model (Sentence-BERT) along with NLP tools (NLTK) to identify and compare objects mentioned in the generated caption. It then calculates a score based on how many of the detected objects from the image are correctly mentioned in the caption.

How MRGD Works

MRGD guides the MLLM’s generation by combining the scores from these two reward models. A user can dynamically adjust a “guidance strength” hyperparameter, ‘w’, between 0 and 1. If ‘w’ is 1, the model prioritizes hallucination reduction. If ‘w’ is 0, it prioritizes object recall. Values in between allow for a smooth trade-off.

During decoding, the MLLM generates multiple candidate completions for a partial response. Each candidate is then evaluated using the combined reward score. The candidate with the highest score is selected and added to the response, and this process continues until the full caption is generated. This search-based approach allows for more precise control than simple fine-tuning.

Also Read:

Experimental Results and Impact

The researchers evaluated MRGD on standard object hallucination benchmarks like CHAIR and AMBER, using MLLMs such as LLaVA-1.5, Llama-3.2-Vision, and SmolVLM-2. The results show that MRGD significantly reduces object hallucinations compared to greedy decoding and consistently outperforms existing hallucination mitigation methods, including other fine-tuning and guided decoding approaches.

A key finding is the inherent trade-off between object precision and recall in MLLMs. MRGD allows users to navigate this trade-off effectively. For instance, setting ‘w’ to 1.0 drastically reduces hallucinations, while a lower ‘w’ (e.g., 0.0) boosts object recall, albeit with a higher hallucination rate. The method also demonstrates a trade-off between visual grounding quality and computational cost, where increasing the number of candidate samples (‘k’) improves grounding but requires more compute.

Importantly, the reward models developed for MRGD can be applied to new MLLMs without requiring retraining, making the approach highly adaptable. This flexibility and granular control over MLLM outputs represent a significant step forward in making these powerful models more reliable and user-friendly. For more technical details, you can refer to the full research paper: Controlling Multimodal LLMs via Reward-guided Decoding.

While the current work focuses on object hallucinations, the authors suggest future work could extend MRGD to mitigate other types of visual hallucinations, explore reward models for semantically incomplete outputs, and apply the method to discriminative tasks.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -