TLDR: The Seeing Culture Benchmark (SCB) is a new dataset and evaluation framework designed to test vision-language models (VLMs) on cultural reasoning and visual grounding, particularly focusing on underrepresented Southeast Asian cultures. It uses a two-stage process of multiple-choice visual question answering and cultural artifact segmentation. Findings show VLMs struggle with culturally similar visual options and exhibit a significant gap between visual reasoning accuracy and spatial grounding ability, highlighting the need for more culturally aware AI.
Multimodal vision-language models (VLMs) have shown great promise in understanding both visual and textual information. However, when it comes to cultural understanding, especially cultural reasoning, existing datasets and benchmarks often fall short. They frequently lack the depth for true cultural reasoning, underrepresent many cultures, and often use AI-generated questions that may not capture authentic cultural nuances.
Introducing the Seeing Culture Benchmark (SCB)
To address these critical gaps, researchers have introduced the Seeing Culture Benchmark (SCB). This novel benchmark is specifically designed to challenge VLMs in cultural reasoning and visual grounding. It focuses on the rich and diverse cultures of seven Southeast Asian countries: Cambodia, Myanmar, Indonesia, Vietnam, the Philippines, Malaysia, and Thailand. These cultures are often overlooked in mainstream datasets, making SCB a crucial step towards more inclusive AI.
The SCB employs a unique two-stage approach. First, VLMs are presented with multiple-choice visual question answering (VQA) tasks, where they must select the correct visual option from a set of culturally rich images. These visual options are carefully organized into three types: those from the same country as the correct answer, those from different countries, or a mixed group. All options, however, belong to the same category (e.g., all dances or all music instruments). Only upon correctly answering the VQA question does the model proceed to the second stage.
In the second stage, the VLM is required to segment the relevant cultural artifact within the chosen image, providing visual evidence for its reasoning. This spatial grounding component is vital for confirming that the model truly understands the cultural context, rather than just making a lucky guess.
A Rich and Curated Dataset
The SCB dataset is extensive, comprising 1,065 images that showcase 138 distinct cultural artifacts across five main categories: music, game, dance, celebration, and wedding. These images are paired with 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. This human-centric approach ensures that the questions genuinely reflect authentic cultural narratives and avoid potential biases from AI-generated content. The questions are designed to require deeper reasoning, often focusing on the symbols or cultural significance associated with an artifact rather than just its name.
The images themselves are complex, featuring various distracting objects or scenes, sometimes even other cultural artifacts, to truly challenge the models. Segmentation is performed using polygons for fine-grained detail, rather than simpler bounding boxes.
Key Findings and Challenges
Evaluations of various state-of-the-art VLMs on SCB revealed several important insights. Models performed significantly worse when visual options originated from the same country (Type 1 questions) compared to when options came from different countries (Type 2 questions). This suggests that contextual clues related to country or region can heavily influence a VLM’s ability to discern the correct answer. Performance on Type 3 (mixed culture) questions was intermediate.
A notable disparity was observed between visual reasoning capabilities and spatial grounding. While some VLMs, like GPT-o3, achieved high accuracy in selecting the correct visual option (over 90%), their ability to accurately segment the cultural artifact (mean Intersection over Union, mIoU) was considerably lower (not surpassing 33%). This highlights that even if a model can identify the correct cultural context, it often struggles to precisely locate and substantiate its reasoning visually. This gap is not solely due to limited object segmentation capabilities, as grounding by reasoning showed an average drop of 16% in mIoU compared to grounding by simply referring to cultural objects.
Furthermore, VLMs generally performed best in the ‘dance’ category, which often features specific dancer characters, and worst in the ‘celebration’ category, which encompasses more intangible cultural concepts.
Also Read:
- CultureScope: A Deeper Look into AI’s Cultural Competence
- ProtoVQA: Enhancing Visual Question Answering with Transparent Explanations
Guiding Future Developments
The Seeing Culture Benchmark serves as a crucial resource for identifying the shortcomings of current VLMs in cross-modal cultural reasoning and spatial grounding within culturally nuanced scenarios. By highlighting these challenges, SCB aims to guide future developments in the field, fostering the creation of more culturally conscious and capable AI models. For more details, you can read the full research paper here.


