Advancing Persian AI: The PERCOR Commonsense Reasoning Dataset

TLDR: PERCOR is the first large-scale Persian commonsense reasoning benchmark, featuring 106,000 multiple-choice sentence-completion problems. It uses a novel conjunction-based segmentation for diverse content and a generation-free adversarial filtering method called DRESS-AF to create challenging distractors. While top proprietary models achieve over 90% accuracy, the best open-source models lag by about 10%, highlighting a significant performance gap in Persian commonsense reasoning. The dataset is designed to be difficult for AI but solvable by humans, and DRESS-AF also proved effective in increasing the difficulty of English benchmarks like HellaSwag.

A groundbreaking new benchmark, PERCOR (Persian Commonsense Reasoning), has been introduced to address a significant gap in evaluating and improving commonsense reasoning abilities in the Persian language. This marks the first large-scale Persian dataset of its kind, offering a robust platform for advancing natural language understanding in a low-resource language.

Developed by Morteza Alikhani, Mohammadtaha Bagherifard, Erfan Zinvandi, and Mehran Sarmadi from MCINEXT, PERCOR consists of an impressive 106,000 multiple-choice sentence-completion problems. These problems are meticulously drawn from over forty diverse Persian web sources, including news, cultural sites, and other online content, ensuring a broad range of topics and linguistic styles.

A Novel Approach to Dataset Creation

The creation of PERCOR involved a sophisticated three-stage pipeline. First, a vast collection of raw text segments was gathered from the Corpesia corpus, a large-scale resource of Persian websites. This data was then cleaned to remove irrelevant sections while preserving the original paragraph structure.

The second stage introduced a novel conjunction-based segmentation strategy for generating coherent sentence–completion pairs. Unlike methods that rely on temporally grounded data like video captions, this approach splits sentences at high-frequency Persian conjunctions, promoting natural flow and semantic coherence across a wide array of textual sources. To ensure the quality of these pairs, a lightweight filtering step using the GPT-4o-mini model was employed to verify that conjunctions functioned as true discourse connectives and that completion segments were syntactically and semantically complete.

The third and perhaps most innovative stage involved distractor generation, utilizing a method called DRESS-AF (Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering). This technique is crucial for creating challenging multiple-choice questions. Instead of generating distractors using large language models, which can introduce biases, DRESS-AF selects them from a pool of existing gold completions. It ranks these candidates based on embedding similarity scores and then adversarially optimizes parameters to maximize model confusion, ensuring the distractors are difficult for AI models but still solvable by humans. This language-agnostic method was also successfully applied to the English HellaSwag benchmark, demonstrating its versatility in increasing dataset difficulty without compromising human solvability.

Benchmarking AI Performance

The researchers benchmarked 32 state-of-the-art large language models (LLMs), both open-source and closed-source, on the PERCOR dataset in a zero-shot setting. The results highlight a significant performance gap. Human annotators achieved an accuracy of 89% on PERCOR, indicating the questions are plausible and non-trivial. Among the AI models, OpenAI-o3 achieved the highest performance at 92.18%, closely followed by Claude-Sonnet-3.7 at 91.17%. These proprietary models currently surpass non-expert human annotators.

However, the strongest open-source model, DeepSeek-R1, reached 82.51%, underscoring a persistent gap of approximately 10% between the best open and closed systems. Many other open-source models performed in the 60% to 80% range. The study also revealed the importance of prompt-following and output format adherence, as post-processing of model outputs significantly improved accuracy for some models that embedded correct answers within extra prose.

Interestingly, even lightweight fine-tuning with a small subset of the training data (10%) on instruction-tuned open models like LLaMA3.3-70B-Instruct and Qwen3-32B-Instruct led to substantial improvements, surpassing the strongest zero-shot open-source baselines. This suggests significant latent capabilities in these models that can be unlocked with minimal task-specific supervision.

Also Read:

Looking Ahead

PERCOR represents a vital step forward for Persian natural language processing, providing a challenging benchmark for commonsense reasoning. The dataset and the DRESS-AF method are expected to catalyze further research in multilingual commonsense reasoning and foster the development of more robust and culturally-aware language models. The dataset is publicly available for researchers to use at HuggingFace.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Persian AI: The PERCOR Commonsense Reasoning Dataset

A Novel Approach to Dataset Creation

Benchmarking AI Performance

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates