Evaluating Multimodal AI in Persian: Introducing the MEENA Dataset

TLDR: MEENA is the first comprehensive dataset for evaluating Vision-Language Models (VLMs) in Persian, featuring 7,500 Persian and 3,000 English questions across various subjects and educational levels. It addresses the English-centric bias in VLM benchmarks, includes rich metadata, and provides insights into VLM performance on knowledge-based vs. reasoning tasks, hallucination detection, and language-specific challenges.

Recent advancements in artificial intelligence, particularly in Vision-Language Models (VLMs), have opened up new possibilities for how machines understand and interact with the world. These models are designed to interpret both visual information (like images and diagrams) and textual data, enabling them to perform complex tasks such as answering questions about images or generating captions. However, a significant challenge has been the predominant focus on the English language, leaving a considerable gap for other languages and cultures.

To address this critical imbalance, a new research paper introduces MEENA, also known as PersianMMMU. This groundbreaking dataset is the first of its kind specifically designed to evaluate Persian Vision-Language Models across a wide array of tasks, including scientific reasoning, general knowledge, and human-level understanding. The creation of MEENA marks a pivotal step towards making AI more inclusive and capable across diverse linguistic contexts.

What is MEENA?

MEENA stands for Multimodal-Multilingual Educational Exams for N-level Assessment. The name “Mina” itself holds cultural significance in Persian, referring to glass and a traditional art form called Mina-kari, which subtly connects to the dataset’s multimodal nature, including questions related to art. The dataset is comprehensive, comprising approximately 7,500 questions in Persian and 3,000 in English. These questions span a broad spectrum of topics, from core academic subjects like mathematics, physics, and reasoning to more culturally specific areas such as Persian art and literature. It also includes questions involving diagrams and charts, ensuring a holistic evaluation of VLM capabilities.

The dataset is meticulously crafted with several key features:

Diverse Subject Coverage: Questions are drawn from various educational levels, ranging from primary school to upper secondary school, ensuring a wide assessment of knowledge and reasoning skills.
Rich Metadata: Each question comes with detailed metadata, including difficulty levels, descriptive answers, and indicators for “trap” questions designed to test deeper reasoning. This allows for granular analysis of model performance.
Original Persian Data: Unlike many translated datasets, MEENA features original Persian content, preserving crucial cultural nuances that are often lost in direct translations.
Bilingual Structure: The inclusion of both Persian and English questions allows for direct comparison and assessment of cross-linguistic performance of VLMs.
Extensive Experiments: The benchmark facilitates diverse experiments to evaluate overall performance, a model’s ability to attend to images, and its tendency to generate “hallucinations” (incorrect or irrelevant information).

How Was MEENA Developed?

The questions in MEENA are primarily sourced from Iran’s 12-year educational framework, specifically from the “Pellekan Yadgiri” (Learning Ladder) platform and a curated selection of questions from Iranian national university entrance exams. The compilation process involved rigorous steps like data extraction and cleaning, careful image processing to ensure compatibility with VLMs (e.g., merging multiple images into a single one), and content filtering to maintain relevance and quality. A significant portion of the dataset also includes a bilingual subset, where Persian questions with images were translated into English, with a strong emphasis on maintaining semantic accuracy using advanced language models like GPT-4o and a “LLM-as-a-Judge” approach for quality control.

Key Findings from the Research

The researchers conducted a series of experiments using prominent VLMs such as GPT-4o, GPT-4o-mini, GPT-4-Turbo, Gemini-2.0-flash, and InstructBLIP-T5. These experiments explored various scenarios, including Zero-Shot (minimal guidance), In-Context Learning (with examples), First Describe (forcing detailed image descriptions), Wrong Image (with intentionally mismatched images), and Without Image (text-only). The findings revealed several important trends:

Knowledge vs. Reasoning: Models consistently performed better on knowledge-based tasks compared to reasoning-based tasks. This gap was even more pronounced in Persian, suggesting that complex reasoning in this language poses a greater challenge for current VLMs.
Hallucination Detection: Gemini 2.0 Flash demonstrated superior ability in detecting image mismatches, indicating better robustness against generating hallucinatory content, especially in Persian contexts.
Image Presence Detection: GPT-4-Turbo and GPT-4o were more accurate in recognizing when an image was present, exhibiting lower rates of mistakenly reporting “no image.” Conversely, Gemini 2.0 Flash showed a higher incidence of these errors, particularly for Persian inputs.
Difficulty Levels: As expected, models struggled more with higher-level questions in subjects like Chemistry and Mathematics, with performance generally declining as complexity increased.

Also Read:

The Impact of MEENA

The introduction of MEENA is a significant contribution to the field of multimodal AI. It provides a much-needed benchmark for evaluating VLMs in Persian, a relatively low-resource language, and highlights the challenges and opportunities in developing truly multilingual and culturally aware AI systems. By offering a comprehensive and authentic dataset, MEENA paves the way for enhancing VLM capabilities beyond English, fostering more inclusive and globally relevant AI technologies. Researchers and developers can access the dataset and code on HuggingFace and GitHub, respectively, with a leaderboard available to track model performance. For more details, you can refer to the full research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Multimodal AI in Persian: Introducing the MEENA Dataset

What is MEENA?

How Was MEENA Developed?

Key Findings from the Research

The Impact of MEENA

Gen AI News and Updates

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

Accelerating ML Hardware Design: A New Benchmark and AI Models for FPGA Resource Estimation

Unlocking Advanced Visual Reasoning in AI with Long Grounded Thoughts

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates