TLDR: A new research paper introduces a four-metric suite (grounding efficiency, content alignment, lexical adaptation, human-likeness) to evaluate how Vision Language Models (VLMs) establish common ground in interactive dialogues, rather than just their task success. The study, using the PhotoBook game with GPT4.1, GPT4o-mini, and Claude3.5-Haiku, found that VLMs significantly diverge from human communication patterns, with GPT4o-mini being the closest. Key findings include that task success and image-utterance alignment do not guarantee successful grounding, and VLMs exhibit “sycophantic” behaviors. The research emphasizes the need for training methods that foster collaborative, incremental dialogue for more human-like AI.
Large Vision Language Models (VLMs) are becoming increasingly sophisticated, often claiming advanced reasoning skills. However, a new research paper from Northeastern University highlights a critical gap in how these models are typically evaluated. Current benchmarks often focus on single-turn interactions or simple question-answering, which doesn’t fully capture the complex, interactive process of “common ground” building that humans engage in during communication.
Common ground refers to the shared understanding that people gradually develop through ongoing dialogue. To truly build collaborative AI systems, models need to establish this shared understanding efficiently, much like humans do. This involves adapting vocabulary, being concise, and understanding when mutual understanding has been achieved.
The researchers, Saki Imai, Mert Inan, Anthony Sicilia, and Malihe Alikhani, introduce a comprehensive four-metric suite to systematically evaluate VLM performance in these interactive grounding contexts. These metrics are:
Grounding Efficiency
This measures how efficiently VLM pairs reach common ground compared to humans. It looks at task success (correctly identifying shared images), word count (total words produced), and turn count (number of conversational turns). The study found that humans achieve the highest task success with fewer words but significantly more turns, indicating a more incremental and refined communication style. VLMs, in contrast, often used more words and fewer turns, suggesting less efficient communication.
Content Alignment
This metric assesses how closely VLM utterances align with the visual referents. It uses CLIPScore, both absolute and contrastive, to see if models describe diagnostic features that uniquely identify a target image among distractors. Interestingly, the research found that high image-utterance alignment (high CLIPScore) does not necessarily predict task success. Humans, for example, achieved near-perfect task scores despite having lower alignment scores, implying they simplify descriptions as mutual knowledge grows, a pragmatic reasoning VLMs often miss.
Lexical Adaptation
This evaluates whether VLM pairs form human-like “conceptual pacts,” meaning they reuse each other’s terms and prune redundant details over time. The Word Novelty Rate (WNR) was used to quantify this. Humans showed the steepest decline in WNR, indicating strong lexical stabilization. While some VLMs, like Claude3.5, showed moderate adaptation, GPT-4 models struggled more to stabilize and reuse previously grounded expressions.
Also Read:
- New Benchmark Challenges AI’s Understanding of Space
- The Hidden Impact of Prompt Design on Multimodal AI Performance
Human-likeness
This metric uses Discrete Energy Distance to gauge how human-like VLM utterances are at a distributional level, capturing whether the overall distribution of VLM dialogues resembles human interactions. GPT4o-mini was found to be the most human-like overall, with its utterance distribution closest to human data, while Claude3.5 and GPT4.1 diverged more stylistically.
The study deployed this suite on 150 self-play sessions of interactive referential games using the PhotoBook task, comparing three proprietary VLMs (GPT4.1, GPT4o-mini, and Claude3.5-Haiku) with human dyads. A key finding was that all three models diverged from human patterns on at least three of the four metrics, with GPT4o-mini being the closest overall. The researchers also observed that task success scores alone do not indicate successful grounding, and high image-utterance alignment doesn’t necessarily predict task success.
A notable observation was the “sycophantic” behavior in VLMs, where they sometimes adapt their guesses based on their partner’s revealed responses, especially when ground-truth labels coincidentally matched. This can inflate scores and create a false impression of successful grounding. Prompt engineering was shown to mitigate this effect to some extent.
The researchers attribute these divergences to several factors: a mismatch in training data (VLMs are trained on single image captions, not multi-round dialogues), reward alignment bias (RLHF often rewards “agreeable” responses, leading to mirroring), and the effortless nature of token generation for VLMs (unlike humans, they face no cognitive cost for verbosity). This leads to unnecessarily long utterances and a lack of established shorthand.
This groundbreaking research provides a crucial framework for future work on VLM grounding, emphasizing the need for training methods that encourage incremental, collaborative dialogue over isolated, verbose responses. It highlights that focusing on the “how” of VLM communication, not just the “whether” of task completion, is essential for developing truly collaborative AI systems. You can read the full paper here: Measuring How (Not Just Whether) VLMs Build Common Ground.


