TLDR: PERCOR is the first large-scale Persian commonsense reasoning benchmark, featuring 106,000 multiple-choice sentence-completion problems. It uses a novel conjunction-based segmentation for diverse content and a generation-free adversarial filtering method called DRESS-AF to create challenging distractors. While top proprietary models achieve over 90% accuracy, the best open-source models lag by about 10%, highlighting a significant performance gap in Persian commonsense reasoning. The dataset is designed to be difficult for AI but solvable by humans, and DRESS-AF also proved effective in increasing the difficulty of English benchmarks like HellaSwag.
A groundbreaking new benchmark, PERCOR (Persian Commonsense Reasoning), has been introduced to address a significant gap in evaluating and improving commonsense reasoning abilities in the Persian language. This marks the first large-scale Persian dataset of its kind, offering a robust platform for advancing natural language understanding in a low-resource language.
Developed by Morteza Alikhani, Mohammadtaha Bagherifard, Erfan Zinvandi, and Mehran Sarmadi from MCINEXT, PERCOR consists of an impressive 106,000 multiple-choice sentence-completion problems. These problems are meticulously drawn from over forty diverse Persian web sources, including news, cultural sites, and other online content, ensuring a broad range of topics and linguistic styles.
A Novel Approach to Dataset Creation
The creation of PERCOR involved a sophisticated three-stage pipeline. First, a vast collection of raw text segments was gathered from the Corpesia corpus, a large-scale resource of Persian websites. This data was then cleaned to remove irrelevant sections while preserving the original paragraph structure.
The second stage introduced a novel conjunction-based segmentation strategy for generating coherent sentence–completion pairs. Unlike methods that rely on temporally grounded data like video captions, this approach splits sentences at high-frequency Persian conjunctions, promoting natural flow and semantic coherence across a wide array of textual sources. To ensure the quality of these pairs, a lightweight filtering step using the GPT-4o-mini model was employed to verify that conjunctions functioned as true discourse connectives and that completion segments were syntactically and semantically complete.
The third and perhaps most innovative stage involved distractor generation, utilizing a method called DRESS-AF (Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering). This technique is crucial for creating challenging multiple-choice questions. Instead of generating distractors using large language models, which can introduce biases, DRESS-AF selects them from a pool of existing gold completions. It ranks these candidates based on embedding similarity scores and then adversarially optimizes parameters to maximize model confusion, ensuring the distractors are difficult for AI models but still solvable by humans. This language-agnostic method was also successfully applied to the English HellaSwag benchmark, demonstrating its versatility in increasing dataset difficulty without compromising human solvability.
Benchmarking AI Performance
The researchers benchmarked 32 state-of-the-art large language models (LLMs), both open-source and closed-source, on the PERCOR dataset in a zero-shot setting. The results highlight a significant performance gap. Human annotators achieved an accuracy of 89% on PERCOR, indicating the questions are plausible and non-trivial. Among the AI models, OpenAI-o3 achieved the highest performance at 92.18%, closely followed by Claude-Sonnet-3.7 at 91.17%. These proprietary models currently surpass non-expert human annotators.
However, the strongest open-source model, DeepSeek-R1, reached 82.51%, underscoring a persistent gap of approximately 10% between the best open and closed systems. Many other open-source models performed in the 60% to 80% range. The study also revealed the importance of prompt-following and output format adherence, as post-processing of model outputs significantly improved accuracy for some models that embedded correct answers within extra prose.
Interestingly, even lightweight fine-tuning with a small subset of the training data (10%) on instruction-tuned open models like LLaMA3.3-70B-Instruct and Qwen3-32B-Instruct led to substantial improvements, surpassing the strongest zero-shot open-source baselines. This suggests significant latent capabilities in these models that can be unlocked with minimal task-specific supervision.
Also Read:
- Unlocking Deeper AI Logic: The NoRA Benchmark for Relational Reasoning
- Unpacking Nuance: New Benchmarks Evaluate Language Models’ Pragmatic Understanding in Slovene
Looking Ahead
PERCOR represents a vital step forward for Persian natural language processing, providing a challenging benchmark for commonsense reasoning. The dataset and the DRESS-AF method are expected to catalyze further research in multilingual commonsense reasoning and foster the development of more robust and culturally-aware language models. The dataset is publicly available for researchers to use at HuggingFace.


