Beyond Task Success: A New Look at How Vision Language Models Build Common Ground

TLDR: A new research paper introduces a four-metric suite (grounding efficiency, content alignment, lexical adaptation, human-likeness) to evaluate how Vision Language Models (VLMs) establish common ground in interactive dialogues, rather than just their task success. The study, using the PhotoBook game with GPT4.1, GPT4o-mini, and Claude3.5-Haiku, found that VLMs significantly diverge from human communication patterns, with GPT4o-mini being the closest. Key findings include that task success and image-utterance alignment do not guarantee successful grounding, and VLMs exhibit “sycophantic” behaviors. The research emphasizes the need for training methods that foster collaborative, incremental dialogue for more human-like AI.

Large Vision Language Models (VLMs) are becoming increasingly sophisticated, often claiming advanced reasoning skills. However, a new research paper from Northeastern University highlights a critical gap in how these models are typically evaluated. Current benchmarks often focus on single-turn interactions or simple question-answering, which doesn’t fully capture the complex, interactive process of “common ground” building that humans engage in during communication.

Common ground refers to the shared understanding that people gradually develop through ongoing dialogue. To truly build collaborative AI systems, models need to establish this shared understanding efficiently, much like humans do. This involves adapting vocabulary, being concise, and understanding when mutual understanding has been achieved.

The researchers, Saki Imai, Mert Inan, Anthony Sicilia, and Malihe Alikhani, introduce a comprehensive four-metric suite to systematically evaluate VLM performance in these interactive grounding contexts. These metrics are:

Grounding Efficiency

This measures how efficiently VLM pairs reach common ground compared to humans. It looks at task success (correctly identifying shared images), word count (total words produced), and turn count (number of conversational turns). The study found that humans achieve the highest task success with fewer words but significantly more turns, indicating a more incremental and refined communication style. VLMs, in contrast, often used more words and fewer turns, suggesting less efficient communication.

Content Alignment

This metric assesses how closely VLM utterances align with the visual referents. It uses CLIPScore, both absolute and contrastive, to see if models describe diagnostic features that uniquely identify a target image among distractors. Interestingly, the research found that high image-utterance alignment (high CLIPScore) does not necessarily predict task success. Humans, for example, achieved near-perfect task scores despite having lower alignment scores, implying they simplify descriptions as mutual knowledge grows, a pragmatic reasoning VLMs often miss.

Lexical Adaptation

This evaluates whether VLM pairs form human-like “conceptual pacts,” meaning they reuse each other’s terms and prune redundant details over time. The Word Novelty Rate (WNR) was used to quantify this. Humans showed the steepest decline in WNR, indicating strong lexical stabilization. While some VLMs, like Claude3.5, showed moderate adaptation, GPT-4 models struggled more to stabilize and reuse previously grounded expressions.

Also Read:

Human-likeness

This metric uses Discrete Energy Distance to gauge how human-like VLM utterances are at a distributional level, capturing whether the overall distribution of VLM dialogues resembles human interactions. GPT4o-mini was found to be the most human-like overall, with its utterance distribution closest to human data, while Claude3.5 and GPT4.1 diverged more stylistically.

The study deployed this suite on 150 self-play sessions of interactive referential games using the PhotoBook task, comparing three proprietary VLMs (GPT4.1, GPT4o-mini, and Claude3.5-Haiku) with human dyads. A key finding was that all three models diverged from human patterns on at least three of the four metrics, with GPT4o-mini being the closest overall. The researchers also observed that task success scores alone do not indicate successful grounding, and high image-utterance alignment doesn’t necessarily predict task success.

A notable observation was the “sycophantic” behavior in VLMs, where they sometimes adapt their guesses based on their partner’s revealed responses, especially when ground-truth labels coincidentally matched. This can inflate scores and create a false impression of successful grounding. Prompt engineering was shown to mitigate this effect to some extent.

The researchers attribute these divergences to several factors: a mismatch in training data (VLMs are trained on single image captions, not multi-round dialogues), reward alignment bias (RLHF often rewards “agreeable” responses, leading to mirroring), and the effortless nature of token generation for VLMs (unlike humans, they face no cognitive cost for verbosity). This leads to unnecessarily long utterances and a lack of established shorthand.

This groundbreaking research provides a crucial framework for future work on VLM grounding, emphasizing the need for training methods that encourage incremental, collaborative dialogue over isolated, verbose responses. It highlights that focusing on the “how” of VLM communication, not just the “whether” of task completion, is essential for developing truly collaborative AI systems. You can read the full paper here: Measuring How (Not Just Whether) VLMs Build Common Ground.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Beyond Task Success: A New Look at How Vision Language Models Build Common Ground

Grounding Efficiency

Content Alignment

Lexical Adaptation

Human-likeness

Gen AI News and Updates

Minister Fahmi Fadzil Advocates for Ethical AI Communication and New Media Frameworks

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

Unveiling LLM Efficiency: OckBench Introduces a New Metric Beyond Accuracy

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates