Evaluating Multimodal Summaries: Introducing MDSEval, a New Benchmark for Dialogue

TLDR: MDSEval is the first meta-evaluation benchmark for Multimodal Dialogue Summarization (MDS), featuring image-sharing dialogues, human-annotated summaries across eight quality aspects, and a novel MEKI filtering framework. Benchmarking reveals that current MLLM-based evaluation methods struggle to align with human judgments due to biases like score concentration and positional preferences, highlighting the need for more robust assessment techniques.

Human communication is naturally multimodal, involving text, images, videos, and audio. This has led to the rise of Multimodal Large Language Models (MLLMs), which combine information from different sources to create more natural and effective interactions. A key application of these models is Multimodal Dialogue Summarization (MDS), a task that aims to condense important information from conversations that include various forms of media, such as image-sharing chats.

Developing effective MDS models requires reliable automatic evaluation methods to speed up development and reduce the need for manual assessments. However, these automatic methods themselves need a strong benchmark, based on human judgments, to ensure they are accurate. Until now, such a benchmark for MDS was missing.

To fill this gap, researchers have introduced MDSEval, the first meta-evaluation benchmark specifically designed for Multimodal Dialogue Summarization. MDSEval provides a comprehensive dataset that includes image-sharing dialogues, several candidate summaries for each dialogue, and detailed human evaluations across eight distinct quality aspects. This benchmark allows for systematic comparisons of different evaluation methods, highlights their weaknesses, and offers valuable insights for creating more accurate and human-aligned assessment techniques for multimodal summarization.

How MDSEval Was Created

The creation of MDSEval involved a careful multi-stage process. It includes 198 high-quality image-sharing dialogues, which were selected from existing datasets like PhotoChat and DialogCC. To ensure these dialogues were suitable and challenging enough for summarization – meaning that good summaries would need to use information from both text and images – a new data filtering framework was introduced. This framework uses a concept called Mutually Exclusive Key Information (MEKI).

MEKI is designed to identify information that is uniquely conveyed by one modality (either text or image) and cannot be easily guessed from the other. This emphasizes the need for true multimodal understanding in summarization. The research found that MEKI scores strongly correlate with human judgments, indicating its effectiveness in identifying complex multimodal dialogues.

For each image-sharing dialogue, five summaries were generated using various state-of-the-art MLLMs and different prompting strategies. This was done to create a diverse range of summary qualities for evaluation.

Understanding Summary Quality: The Eight Evaluation Aspects

To thoroughly assess the quality of summaries, MDSEval defines eight specific evaluation aspects tailored for the MDS task. These aspects focus on capturing cross-modal understanding and overall summary quality:

Multimodal Coherence: How naturally the summary integrates information from both images and text.
Conciseness: How efficiently the summary conveys essential information without being overly wordy.
Multimodal Coverage (Visual, Textual, and Overall): The extent to which the summary captures key information from visual elements, textual dialogue, and both combined.
Multimodal Information Balancing: How well the summary balances information from different modalities, avoiding overemphasis on one.
Topic Progression: How accurately the summary captures the flow of topics and associates images with relevant parts of the dialogue.
Multimodal Faithfulness: Evaluated at the sentence level, this assesses whether the summary accurately reflects the original dialogue and images without introducing incorrect or fabricated information.

These aspects were meticulously annotated by experienced human experts, with strong agreement among annotators, ensuring the reliability of the benchmark.

Benchmarking Results: Current Limitations of MLLM Evaluators

The research benchmarked several state-of-the-art multimodal assessment methods on MDSEval, including MLLM-as-a-Judge, Image-to-Prompt, and LLaVA-Critic. The findings revealed significant limitations:

Weak Alignment with Human Judgments: Current MLLM-based evaluators consistently showed a weak correlation with human preferences. They struggled to differentiate between summaries generated by advanced MLLMs.
Score Concentration Bias: A primary issue identified was a systematic bias where evaluators tended to “hedge” their assessments, producing scores within a very limited range, often concentrating around a score of 4. This lack of variance makes it hard for them to distinguish nuanced quality differences.
Ineffectiveness of Image-Prompting for Visual Coverage: Methods that translate images into textual descriptions for MLLMs (like Image-to-Prompt) were particularly poor at assessing visual information coverage, likely due to information loss during this translation.
Positional Bias: In pairwise comparisons, some MLLMs showed a preference for either the first or second option presented, regardless of quality.

Overall, the results suggest that while MLLMs are powerful, current methods for using them as evaluators still struggle to provide human-aligned judgments when assessing summaries from other advanced MLLMs.

Also Read:

Conclusion and Future Directions

MDSEval represents a crucial step forward in the field of multimodal dialogue summarization by providing the first meta-evaluation benchmark with detailed human annotations. It introduces novel concepts like MEKI to ensure genuine multimodal understanding is required for summarization. The benchmark has highlighted significant biases and limitations in existing MLLM-based evaluation methods, paving the way for the development of more robust and human-aligned assessment techniques.

Future work could expand MDSEval to include more diverse dialogue scenarios, such as customer service or workplace conversations, and incorporate richer modalities like video and audio to make the benchmark even more comprehensive and realistic. You can find the full research paper here: MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Multimodal Summaries: Introducing MDSEval, a New Benchmark for Dialogue

How MDSEval Was Created

Understanding Summary Quality: The Eight Evaluation Aspects

Benchmarking Results: Current Limitations of MLLM Evaluators

Conclusion and Future Directions

Gen AI News and Updates

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

Unveiling LLM Efficiency: OckBench Introduces a New Metric Beyond Accuracy

FractalBench Reveals AI’s Struggle with Visual-Mathematical Abstraction

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates