spot_img
HomeResearch & DevelopmentEvaluating Multimodal AI in Persian: Introducing the MEENA Dataset

Evaluating Multimodal AI in Persian: Introducing the MEENA Dataset

TLDR: MEENA is the first comprehensive dataset for evaluating Vision-Language Models (VLMs) in Persian, featuring 7,500 Persian and 3,000 English questions across various subjects and educational levels. It addresses the English-centric bias in VLM benchmarks, includes rich metadata, and provides insights into VLM performance on knowledge-based vs. reasoning tasks, hallucination detection, and language-specific challenges.

Recent advancements in artificial intelligence, particularly in Vision-Language Models (VLMs), have opened up new possibilities for how machines understand and interact with the world. These models are designed to interpret both visual information (like images and diagrams) and textual data, enabling them to perform complex tasks such as answering questions about images or generating captions. However, a significant challenge has been the predominant focus on the English language, leaving a considerable gap for other languages and cultures.

To address this critical imbalance, a new research paper introduces MEENA, also known as PersianMMMU. This groundbreaking dataset is the first of its kind specifically designed to evaluate Persian Vision-Language Models across a wide array of tasks, including scientific reasoning, general knowledge, and human-level understanding. The creation of MEENA marks a pivotal step towards making AI more inclusive and capable across diverse linguistic contexts.

What is MEENA?

MEENA stands for Multimodal-Multilingual Educational Exams for N-level Assessment. The name “Mina” itself holds cultural significance in Persian, referring to glass and a traditional art form called Mina-kari, which subtly connects to the dataset’s multimodal nature, including questions related to art. The dataset is comprehensive, comprising approximately 7,500 questions in Persian and 3,000 in English. These questions span a broad spectrum of topics, from core academic subjects like mathematics, physics, and reasoning to more culturally specific areas such as Persian art and literature. It also includes questions involving diagrams and charts, ensuring a holistic evaluation of VLM capabilities.

The dataset is meticulously crafted with several key features:

  • Diverse Subject Coverage: Questions are drawn from various educational levels, ranging from primary school to upper secondary school, ensuring a wide assessment of knowledge and reasoning skills.
  • Rich Metadata: Each question comes with detailed metadata, including difficulty levels, descriptive answers, and indicators for “trap” questions designed to test deeper reasoning. This allows for granular analysis of model performance.
  • Original Persian Data: Unlike many translated datasets, MEENA features original Persian content, preserving crucial cultural nuances that are often lost in direct translations.
  • Bilingual Structure: The inclusion of both Persian and English questions allows for direct comparison and assessment of cross-linguistic performance of VLMs.
  • Extensive Experiments: The benchmark facilitates diverse experiments to evaluate overall performance, a model’s ability to attend to images, and its tendency to generate “hallucinations” (incorrect or irrelevant information).

How Was MEENA Developed?

The questions in MEENA are primarily sourced from Iran’s 12-year educational framework, specifically from the “Pellekan Yadgiri” (Learning Ladder) platform and a curated selection of questions from Iranian national university entrance exams. The compilation process involved rigorous steps like data extraction and cleaning, careful image processing to ensure compatibility with VLMs (e.g., merging multiple images into a single one), and content filtering to maintain relevance and quality. A significant portion of the dataset also includes a bilingual subset, where Persian questions with images were translated into English, with a strong emphasis on maintaining semantic accuracy using advanced language models like GPT-4o and a “LLM-as-a-Judge” approach for quality control.

Key Findings from the Research

The researchers conducted a series of experiments using prominent VLMs such as GPT-4o, GPT-4o-mini, GPT-4-Turbo, Gemini-2.0-flash, and InstructBLIP-T5. These experiments explored various scenarios, including Zero-Shot (minimal guidance), In-Context Learning (with examples), First Describe (forcing detailed image descriptions), Wrong Image (with intentionally mismatched images), and Without Image (text-only). The findings revealed several important trends:

  • Knowledge vs. Reasoning: Models consistently performed better on knowledge-based tasks compared to reasoning-based tasks. This gap was even more pronounced in Persian, suggesting that complex reasoning in this language poses a greater challenge for current VLMs.
  • Hallucination Detection: Gemini 2.0 Flash demonstrated superior ability in detecting image mismatches, indicating better robustness against generating hallucinatory content, especially in Persian contexts.
  • Image Presence Detection: GPT-4-Turbo and GPT-4o were more accurate in recognizing when an image was present, exhibiting lower rates of mistakenly reporting “no image.” Conversely, Gemini 2.0 Flash showed a higher incidence of these errors, particularly for Persian inputs.
  • Difficulty Levels: As expected, models struggled more with higher-level questions in subjects like Chemistry and Mathematics, with performance generally declining as complexity increased.

Also Read:

The Impact of MEENA

The introduction of MEENA is a significant contribution to the field of multimodal AI. It provides a much-needed benchmark for evaluating VLMs in Persian, a relatively low-resource language, and highlights the challenges and opportunities in developing truly multilingual and culturally aware AI systems. By offering a comprehensive and authentic dataset, MEENA paves the way for enhancing VLM capabilities beyond English, fostering more inclusive and globally relevant AI technologies. Researchers and developers can access the dataset and code on HuggingFace and GitHub, respectively, with a leaderboard available to track model performance. For more details, you can refer to the full research paper.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -