TLDR: A new research paper introduces a vision-language clustering framework to enhance the evaluation of visual activity recognition systems. By grouping verbs into ‘sense clusters’ that account for synonyms and multiple perspectives, the framework provides a more robust and human-aligned assessment of AI model performance, revealing significant accuracy improvements over traditional exact-match methods.
Evaluating how well artificial intelligence systems understand visual activities, like recognizing someone is ‘cooking’ or ‘riding’ in an image, is a complex challenge. Traditional evaluation methods often fall short because they rely on a single ‘gold standard’ answer. This approach doesn’t account for the natural ambiguities in how we describe actions. For instance, ‘brushing’ and ‘grooming’ can refer to the same event, and an image of a marching band could be accurately described as both ‘marching’ or ‘performing’ depending on the perspective. These nuances are often missed by exact-match evaluations, leading to an incomplete picture of an AI model’s true capabilities.
To address this, researchers Louie Hong Yao, Nicholas Jarvis, and Tianyu Jiang have proposed a novel vision-language clustering framework. This framework aims to create ‘verb sense clusters’ that group together verbs with similar meanings or those that describe the same event from different valid perspectives. This provides a more robust and accurate way to evaluate visual activity recognition systems.
How the Clustering Framework Works
The core of their approach is a two-step clustering process. First, they acquire image-verb pairs using advanced multimodal large language models (LLMs) like GPT-4o mini and Llama-3.2-90B. These LLMs help generate a comprehensive set of appropriate verbs for each image, mitigating the bias that might come from relying solely on original dataset labels.
The first clustering step, called ‘Same-Verb Clustering,’ focuses on disambiguating the fine-grained senses of individual verbs. For each verb, all associated images are collected, and their visual and semantic properties are transformed into high-dimensional embeddings. These embeddings are then clustered to group together instances where the same verb is used in different contexts or senses.
The second step, ‘Cross-Verb Clustering,’ takes the results from the first step and merges them further. This step addresses ambiguities across different verbs, grouping clusters that represent shared meanings or activities. For example, if ‘teaching’ and ‘lecturing’ are found to describe the same underlying activity in various images, they would be grouped into a single sense cluster.
Also Read:
- Enhancing Vision-Language Understanding with Adaptive Multi-Prompt Embeddings
- Advancing AI’s Understanding of Long Videos Through Scene-Based Analysis
Unveiling Ambiguity and Improving Evaluation
Through their analysis of the imSitu dataset, the researchers found significant ambiguity. On average, each image could be described by 2.8 sense clusters, with each cluster representing a distinct perspective. Furthermore, each cluster contained an average of 1.6 synonyms, highlighting the prevalence of synonymous verbs. They also discovered that over 70% of images belonged to more than one cluster, indicating multiple valid perspectives, and over 50% of verbs appeared in multiple clusters, showing polysemy (a single verb having multiple senses).
When evaluating various visual activity recognition models—including supervised models like ResNet and CLIP, and zero-shot multimodal LLMs like GPT-4o mini and Llama—the cluster-based evaluation consistently yielded higher accuracy scores compared to traditional exact-match methods. This improvement was largely attributed to the framework’s ability to account for both synonymous verbs and different perspectives of an image. For instance, LLMs showed substantial gains, with improvements as high as 27% due to addressing perspective-related challenges.
Crucially, the study also performed a human alignment analysis, where human judges manually evaluated model predictions. The results showed that the accuracy derived from the cluster-based approach aligned much more closely with human judgments than exact-match evaluations, which tended to underestimate model performance. This suggests that the proposed framework offers a more nuanced and human-aligned assessment of AI models’ understanding of visual activities.
While the framework demonstrates significant promise, the authors acknowledge limitations, such as its reliance on LLM outputs for verb generation and the absence of a gold-standard clustering for direct comparison. However, the approach is designed to be dataset-agnostic, paving the way for its application to other visual activity recognition benchmarks in future work. For more details, you can read the full research paper here.


