New Benchmark Reveals Modality Imbalance in AI Understanding

TLDR: The SEAM benchmark evaluates Vision-Language Models (VLMs) by testing their ability to reason consistently across semantically equivalent visual and textual inputs in four domains: chess, chemistry, music, and graph theory. It found that VLMs exhibit systematic modality imbalance, with vision often performing worse than language, and low cross-modal agreement. Error analysis points to textual tokenization failures and visual perception failures (hallucinations) as main drivers. SEAM provides a controlled framework to measure and improve modality-agnostic reasoning in VLMs.

A new research paper introduces SEAM, a benchmark designed to rigorously evaluate how consistently Vision-Language Models (VLMs) reason when presented with the same information in different formats—visual and textual. This benchmark aims to uncover whether these advanced AI models truly understand concepts in a unified way, or if their performance is heavily influenced by the specific modality of the input.

The Challenge of Multimodal Reasoning

Vision-Language Models have made significant strides in processing and generating content that combines images and text. However, assessing if they reason consistently across these different representations has been a major hurdle. Traditional comparisons often mix up task differences with modality differences, making it hard to tell if performance gaps are due to genuine reasoning issues or just varying task difficulty. Existing benchmarks have either lacked precise cross-modal alignment or introduced biases, leaving a gap in how we measure true modality-agnostic reasoning.

Introducing SEAM: A New Standard for Evaluation

The SEAM benchmark, short for Semantically Equivalent Across Modalities, tackles this problem head-on. It pairs semantically identical inputs across four distinct domains that have established textual and visual notation systems: chess, chemistry, music, and graph theory. Unlike benchmarks that simply convert text into images (like OCR-based methods), SEAM uses fundamentally different notation systems for each modality. For example, in chess, it compares a visual chessboard with its textual Forsyth-Edwards Notation (FEN) string. In chemistry, it uses structural diagrams versus SMILES strings. Music is represented by sheet music and ABC notation, and graph theory by node-edge diagrams and adjacency matrices.

This unique approach ensures that the information content is precisely the same, allowing researchers to isolate and measure how well VLMs perform when only the representation changes. Each task within SEAM is self-contained within a single modality, preventing confounding factors from joint inference and enabling clear evaluations for language-only, vision-only, and combined language-vision scenarios. The benchmark includes 16 tasks, with 200 items per task, totaling 3,200 multiple-choice questions designed with carefully crafted distractor answers to calibrate difficulty.

Key Findings: Modality Imbalance and Low Agreement

The evaluation of 21 state-of-the-art VLMs using SEAM revealed a systematic modality imbalance. Across the board, models showed significant performance gaps between vision and language inputs. Vision frequently lagged behind language in overall accuracy, even though the problems contained semantically equivalent information. Furthermore, the agreement between answers generated from cross-modal inputs was surprisingly low, often not much better than random chance. This suggests that current models process information very differently across modalities and have considerable room to improve in integrating their reasoning abilities.

The imbalance also varied significantly by domain. In chess and chemistry, models sometimes performed comparably or even slightly better with vision inputs. However, in music, language inputs generally yielded superior results, and this gap widened considerably for graph-related tasks.

Understanding the Errors: Perception Failures

The research identified two primary drivers for these performance discrepancies:

Textual Perception Failures: Many open-source models struggled with tokenization, especially in specialized domain notations like SMILES strings in chemistry or FEN notation in chess. Incorrectly segmenting these strings into meaningless subwords led to fundamental misinterpretations of the information.
Visual Perception Failures: The vision modality also showed limitations, often failing to compensate for textual difficulties. In graph theory tasks, for instance, models exhibited severe hallucinations, incorrectly inferring edges or nodes, particularly when image patches were cut near intersections. This suggests that the process of breaking down images into patches for visual transformers can be problematic.

Also Read:

Implications for Future AI Development

The SEAM benchmark highlights a fundamental limitation in current VLMs: their struggle to reason consistently across semantically equivalent visual and textual representations. This gap indicates that despite impressive advancements, AI models are not yet truly modality-agnostic. The findings provide actionable insights for future research, emphasizing the need for better task-specific tokenizers and domain-specific VLM training. The researchers have publicly released the code, dataset, and a leaderboard to encourage further development in this critical area. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmark Reveals Modality Imbalance in AI Understanding

The Challenge of Multimodal Reasoning

Introducing SEAM: A New Standard for Evaluation

Key Findings: Modality Imbalance and Low Agreement

Understanding the Errors: Perception Failures

Implications for Future AI Development

Gen AI News and Updates

Baidu Unveils Next-Generation AI Accelerators and ERNIE 5.0 Model

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates