Object-Aware AI: A Strategy to Combat Hallucinations in MLLMs

TLDR: CHAIR-DPO is a new method that uses a metric called CHAIR to identify and reduce “hallucinations” (when AI models describe things not present in an image) in Multimodal Large Language Models (MLLMs). It works by creating preference data based on how many hallucinated objects are in an AI’s response and then fine-tuning the MLLM using Direct Preference Optimization (DPO). This approach is simpler than previous methods, doesn’t rely on proprietary models, and significantly improves the factual accuracy of MLLMs without harming their other abilities.

Multimodal Large Language Models (MLLMs) are powerful AI systems that can understand and process information from various sources, including text and images. They are becoming a unified tool for many tasks, from natural language processing to computer vision. However, despite their impressive capabilities, MLLMs often suffer from a significant problem: hallucinations. [RESEARCH_PAPER_URL]

Understanding AI Hallucinations

In the context of MLLMs, a hallucination occurs when the model generates an answer to a user’s query that is not actually reflected in the visual input. For example, an MLLM might describe an object in an image that isn’t there. This issue has been a long-standing challenge in the field, and it becomes even more complex when dealing with multiple data types like text and images.

Introducing CHAIR-DPO: A New Approach to Factual AI

A recent research paper titled “Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization” by Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara from the University of Modena and Reggio Emilia, proposes a novel solution called CHAIR-DPO. This method tackles the problem of hallucinations by treating it as an alignment issue, aiming to guide the MLLM to prefer generating content that is free from hallucinations.

Unlike many existing approaches that require complex processes or rely on proprietary AI models to create synthetic training data, CHAIR-DPO offers a simpler and more efficient framework. It capitalizes on a well-known metric called CHAIR (Captioning with HAllucinations and Inconsistencies Reduced), which was originally designed to measure the extent of hallucinations in image captioning.

How CHAIR-DPO Works

The core idea behind CHAIR-DPO is to use the CHAIR metric to evaluate and rank different answers generated by an MLLM. Here’s a simplified breakdown of the process:

Collecting Preference Data: For a given image and text prompt, the system asks an MLLM to generate two possible answers.
Object Detection: An off-the-shelf object detector is used to identify all the objects truly present in the input image. This creates a ‘ground-truth’ list of objects.
CHAIRi Score Calculation: For each generated answer, a CHAIRi score is calculated. This score represents the fraction of hallucinated objects mentioned in the answer compared to all objects mentioned. An object is considered hallucinated if it’s mentioned in the answer but not found in the ground-truth list from the object detector.
Designating Winners and Losers: The answer with the lower CHAIRi score (fewer hallucinations) is designated as the ‘winner’ (preferred option), and the answer with the higher score is the ‘loser’ (dispreferred option).
Direct Preference Optimization (DPO): These pairs of preferred and dispreferred answers are then used to fine-tune the MLLM using Direct Preference Optimization (DPO). DPO is an effective training method that aligns the model with human preferences by increasing the probability of generating preferred answers and decreasing the probability of generating dispreferred ones. This process makes the MLLM more aware of the actual objects in an image.
Data Filtering: To ensure reliable training, the researchers also implemented a filtering strategy, discarding any pairs of answers where the CHAIRi scores were identical. This prevents the model from learning from ambiguous or noisy data.

Impressive Results and Performance Preservation

The experiments conducted by the researchers demonstrated that CHAIR-DPO achieves state-of-the-art performance in mitigating hallucinations across several benchmarks, including AMBER, CHAIR-MSCOCO, and Object HalBench. It significantly reduces the rate of hallucinated answers.

Crucially, CHAIR-DPO manages to achieve these improvements without negatively impacting the MLLM’s general cognitive abilities. Tests on various benchmarks that evaluate overall MLLM performance (like MME, SEED-Bench, MMMU, Science-QA, and AI2D) showed that the models either maintained or even slightly improved their capabilities. This indicates that CHAIR-DPO effectively makes MLLMs more factually accurate without causing them to ‘forget’ other important knowledge.

Qualitative examples further illustrate the method’s effectiveness. When describing images, MLLMs fine-tuned with CHAIR-DPO not only avoided mentioning non-existent objects but also often provided more fine-grained details about the objects that were actually present.

Also Read:

A Step Towards More Trustworthy AI

In conclusion, CHAIR-DPO represents a significant advancement in making Multimodal Large Language Models more reliable and factually grounded. By leveraging a straightforward metric and an efficient optimization technique, this method provides a practical and effective way to reduce visual hallucinations, paving the way for more trustworthy and accurate AI systems that truly understand what they ‘see’. For more details, you can refer to the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Object-Aware AI: A Strategy to Combat Hallucinations in MLLMs

Understanding AI Hallucinations

Introducing CHAIR-DPO: A New Approach to Factual AI

How CHAIR-DPO Works

Impressive Results and Performance Preservation

A Step Towards More Trustworthy AI

Gen AI News and Updates

Cruise Industry Embraces Generative AI for Enhanced Operations and Guest Experiences

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Unveiling LLM Refusal: A Multi-Directional Approach Using Self-Organizing Maps

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates