TLDR: MizanQA is a new benchmark dataset with over 1,700 multiple-choice questions designed to evaluate Large Language Models (LLMs) on Moroccan legal question answering. It addresses the unique linguistic and cultural complexities of Moroccan law, which blends Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Initial benchmarking reveals significant performance gaps in current LLMs, highlighting the need for domain-specific development and tailored evaluation metrics, especially for multi-answer questions and culturally specific legal terminology.
Large Language Models (LLMs) have made incredible strides in understanding and generating human language, but their performance often falters in highly specialized areas, especially those with limited digital resources, like Arabic legal contexts. A new research paper introduces MizanQA, a groundbreaking benchmark designed to rigorously test LLMs on Moroccan legal question answering tasks.
What is MizanQA?
MizanQA, named after the Arabic word for “scale”—a universal symbol of justice—is a comprehensive dataset comprising over 1,700 multiple-choice questions. These questions are meticulously crafted to capture the intricate linguistic and legal complexities unique to Moroccan law. Unlike many existing benchmarks, MizanQA includes questions that require selecting multiple correct options, adding a layer of difficulty that mirrors real-world legal assessments.
The Moroccan legal system is particularly challenging for AI. Its language is Modern Standard Arabic, but it’s deeply interwoven with local legal idioms and cultural references. It draws from a rich tapestry of influences, including Islamic Maliki jurisprudence, Moroccan customary law, and elements of French and international law. This blend introduces specific cultural nuances and archaic or region-specific expressions that are not typically found in standard Arabic language models, making accurate legal question answering a formidable task for LLMs.
How was MizanQA built?
The creation of MizanQA involved a multi-phase, semi-automated process. It began with collecting publicly available Moroccan law MCQ banks and exams. A crucial step involved temporal curation by a legal expert to ensure all documents were based on current legislation. The questions were then organized, often manually, into image batches to facilitate automated extraction using a multimodal LLM (Gemini-2.0-flash). Finally, every extracted question and its options were manually verified by annotators, and categorized by legal topic, such as Criminal Law or the Moroccan Constitution.
Benchmarking LLMs on Moroccan Law
The researchers evaluated several leading multilingual and Arabic-focused LLMs on the MizanQA benchmark, including Allam-2, Gemini-1.5-flash, Gemini-2.0-flash, Llama-3.3, Llama-4-maverick, and Llama-4-scout. Recognizing the unique multi-answer format of Moroccan legal questions, the study also proposed new evaluation metrics beyond traditional accuracy, such as F1-like scores and Partial Match Penalized Accuracy (PMPA), to better assess partial correctness and penalize incorrect selections. Confidence calibration measures (ECEopt and ECEset) were also used to evaluate how well models’ predicted probabilities align with actual correctness.
The benchmarking results revealed significant performance gaps. While Gemini models generally outperformed others across most metrics, and Llama-4-maverick showed superior calibration, all models demonstrated limitations in handling culturally specific terminology and complex reasoning. Performance varied significantly across different legal categories; for instance, LLMs performed better on the Law of Obligations and Contracts and the Moroccan Constitution, possibly due to their alignment with international legal standards. Conversely, areas like the Family Code and Criminal Law, which integrate Islamic jurisprudence and human rights frameworks, proved more challenging.
Also Read:
- Unveiling CETVEL: A New Benchmark for Turkish Language Models
- RoMedQA: Advancing Medical Question Answering for the Romanian Language
Looking Ahead
MizanQA represents a vital first step in creating benchmarks for legal reasoning in low-resource contexts. The findings underscore the critical need for domain-specific benchmarks that reflect the linguistic and cultural diversity of legal systems. This research aims to promote the equitable development and assessment of legal AI systems, ensuring they can provide reliable support in diverse legal environments. The dataset is publicly available for further research and development. You can find more details about this research in the full paper: MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering.
The researchers acknowledge limitations, including coverage bias (not comprehensively representing all Moroccan law or other Arab countries’ laws), potential oversimplification of real-world legal complexity, and the inherent constraints of the multiple-choice format. However, MizanQA stands as a significant contribution towards making legal AI more accessible and effective in specialized domains.


