TLDR: MDSEval is the first meta-evaluation benchmark for Multimodal Dialogue Summarization (MDS), featuring image-sharing dialogues, human-annotated summaries across eight quality aspects, and a novel MEKI filtering framework. Benchmarking reveals that current MLLM-based evaluation methods struggle to align with human judgments due to biases like score concentration and positional preferences, highlighting the need for more robust assessment techniques.
Human communication is naturally multimodal, involving text, images, videos, and audio. This has led to the rise of Multimodal Large Language Models (MLLMs), which combine information from different sources to create more natural and effective interactions. A key application of these models is Multimodal Dialogue Summarization (MDS), a task that aims to condense important information from conversations that include various forms of media, such as image-sharing chats.
Developing effective MDS models requires reliable automatic evaluation methods to speed up development and reduce the need for manual assessments. However, these automatic methods themselves need a strong benchmark, based on human judgments, to ensure they are accurate. Until now, such a benchmark for MDS was missing.
To fill this gap, researchers have introduced MDSEval, the first meta-evaluation benchmark specifically designed for Multimodal Dialogue Summarization. MDSEval provides a comprehensive dataset that includes image-sharing dialogues, several candidate summaries for each dialogue, and detailed human evaluations across eight distinct quality aspects. This benchmark allows for systematic comparisons of different evaluation methods, highlights their weaknesses, and offers valuable insights for creating more accurate and human-aligned assessment techniques for multimodal summarization.
How MDSEval Was Created
The creation of MDSEval involved a careful multi-stage process. It includes 198 high-quality image-sharing dialogues, which were selected from existing datasets like PhotoChat and DialogCC. To ensure these dialogues were suitable and challenging enough for summarization – meaning that good summaries would need to use information from both text and images – a new data filtering framework was introduced. This framework uses a concept called Mutually Exclusive Key Information (MEKI).
MEKI is designed to identify information that is uniquely conveyed by one modality (either text or image) and cannot be easily guessed from the other. This emphasizes the need for true multimodal understanding in summarization. The research found that MEKI scores strongly correlate with human judgments, indicating its effectiveness in identifying complex multimodal dialogues.
For each image-sharing dialogue, five summaries were generated using various state-of-the-art MLLMs and different prompting strategies. This was done to create a diverse range of summary qualities for evaluation.
Understanding Summary Quality: The Eight Evaluation Aspects
To thoroughly assess the quality of summaries, MDSEval defines eight specific evaluation aspects tailored for the MDS task. These aspects focus on capturing cross-modal understanding and overall summary quality:
- Multimodal Coherence: How naturally the summary integrates information from both images and text.
- Conciseness: How efficiently the summary conveys essential information without being overly wordy.
- Multimodal Coverage (Visual, Textual, and Overall): The extent to which the summary captures key information from visual elements, textual dialogue, and both combined.
- Multimodal Information Balancing: How well the summary balances information from different modalities, avoiding overemphasis on one.
- Topic Progression: How accurately the summary captures the flow of topics and associates images with relevant parts of the dialogue.
- Multimodal Faithfulness: Evaluated at the sentence level, this assesses whether the summary accurately reflects the original dialogue and images without introducing incorrect or fabricated information.
These aspects were meticulously annotated by experienced human experts, with strong agreement among annotators, ensuring the reliability of the benchmark.
Benchmarking Results: Current Limitations of MLLM Evaluators
The research benchmarked several state-of-the-art multimodal assessment methods on MDSEval, including MLLM-as-a-Judge, Image-to-Prompt, and LLaVA-Critic. The findings revealed significant limitations:
- Weak Alignment with Human Judgments: Current MLLM-based evaluators consistently showed a weak correlation with human preferences. They struggled to differentiate between summaries generated by advanced MLLMs.
- Score Concentration Bias: A primary issue identified was a systematic bias where evaluators tended to “hedge” their assessments, producing scores within a very limited range, often concentrating around a score of 4. This lack of variance makes it hard for them to distinguish nuanced quality differences.
- Ineffectiveness of Image-Prompting for Visual Coverage: Methods that translate images into textual descriptions for MLLMs (like Image-to-Prompt) were particularly poor at assessing visual information coverage, likely due to information loss during this translation.
- Positional Bias: In pairwise comparisons, some MLLMs showed a preference for either the first or second option presented, regardless of quality.
Overall, the results suggest that while MLLMs are powerful, current methods for using them as evaluators still struggle to provide human-aligned judgments when assessing summaries from other advanced MLLMs.
Also Read:
- Decoding How AI Understands the World: A Multimodal Perspective
- New Benchmark Reveals AI’s Struggle with Visual Humor in Videos
Conclusion and Future Directions
MDSEval represents a crucial step forward in the field of multimodal dialogue summarization by providing the first meta-evaluation benchmark with detailed human annotations. It introduces novel concepts like MEKI to ensure genuine multimodal understanding is required for summarization. The benchmark has highlighted significant biases and limitations in existing MLLM-based evaluation methods, paving the way for the development of more robust and human-aligned assessment techniques.
Future work could expand MDSEval to include more diverse dialogue scenarios, such as customer service or workplace conversations, and incorporate richer modalities like video and audio to make the benchmark even more comprehensive and realistic. You can find the full research paper here: MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization.


