TLDR: A research paper assessed seven Large Language Models (LLMs) on their ability to perform Islamic legal reasoning, specifically in inheritance law, using 1,000 multiple-choice questions. The study found a significant performance gap: commercial models like o3 and Gemini 2.5 achieved over 90% accuracy, demonstrating strong reasoning capabilities. In contrast, open-source models such as ALLaM, Fanar, LLaMA, and Mistral scored below 50%, often making foundational errors, struggling with complex calculations, and even generating fabricated justifications. The findings highlight the limitations of current LLMs in structured legal reasoning and emphasize the need for agentic AI systems supported by expert-guided datasets for specialized domains.
The research paper, “Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation,” explores how Large Language Models (LLMs) perform when faced with the intricate rules of Islamic inheritance law, known as ‘ilm al-mawārīth. This study, conducted by Abdessalam BOUCHEKIF, Samer RASHWANI, Heba Sbahi, Shahd Gaben, Mutez AL-KHATIB, and Mohammed GHALY from Hamad Bin Khalifa University, Qatar, uncovers significant differences in the ability of various LLMs to handle the structured legal reasoning and precise calculations required in this specialized field.
Islamic inheritance law is a deeply rooted and highly structured legal system, drawing its principles from the Quran, the Prophetic tradition (Sunnah), and broader Islamic jurisprudence. It dictates the distribution of a deceased person’s estate through a fixed framework that combines normative principles with exact arithmetic. Successfully navigating these problems demands a combination of cognitive understanding, legal knowledge, and computational accuracy. This includes tasks such as identifying family relationships, applying specific exclusion rules, determining who is an eligible heir, and performing complex adjustments like redistribution (radd) and proportional reduction (‘awl).
The researchers evaluated seven different LLMs in a zero-shot setting, meaning the models received no specific fine-tuning for this task. They used a comprehensive benchmark of 1,000 multiple-choice questions, which covered a diverse array of inheritance scenarios. These questions were designed to test each model’s capacity, from understanding the basic context of an inheritance case to accurately computing the shares as prescribed by Islamic jurisprudence. The dataset was carefully structured into two difficulty levels: 500 “Beginner” questions and 500 “Advanced” questions. This allowed the researchers to differentiate between models that lacked fundamental knowledge and those that struggled with more complex cases requiring deeper legal reasoning and mathematical computation.
The study’s findings revealed a stark performance divide among the evaluated models. Advanced commercial LLMs, specifically o3 and Gemini 2.5, demonstrated exceptional capabilities, achieving impressive accuracies above 90%. GPT-4.5 also showed a respectable performance with 74.0% accuracy, placing it between the top-tier reasoning models and those relying more on heuristic inference. In contrast, several open-source models, including ALLaM, Fanar, LLaMA, and Mistral, scored significantly lower, falling below 50%. This substantial gap highlights the inherent difficulties these models encounter when attempting to adapt to and reason within highly specialized legal domains.
To gain a deeper understanding of these disparities, a detailed error analysis was conducted on a subset of incorrectly answered questions. Errors were broadly categorized into “foundational” and “complex.” Foundational errors encompassed issues such as misinterpreting the problem statement (comprehension errors), incorrectly applying normative legal rules (e.g., misclassifying heirs or misapplying exclusion rules), and making basic computational mistakes. Complex errors, on the other hand, involved failures in advanced mathematical operations crucial for estate division, such as calculatory adjustments (taṣḥīḥ), redistribution (radd), and proportionate reduction (‘awl). Models also struggled with resolving exceptional or disputed legal cases, like those involving intersex individuals or specific juristic disagreements.
Open-source models frequently exhibited foundational errors. A particularly concerning finding was the tendency of some models to generate fabricated Quranic verses or prophetic narrations to justify their answers, which is a serious issue in religious contexts. These models also struggled with correctly identifying fixed shares for primary heirs and often either failed to recognize rightful heirs or erroneously included individuals not mentioned in the scenario. Basic arithmetic errors were also common, even when the models correctly identified the relevant legal rules.
Commercial models, particularly Gemini, showcased strong capabilities in understanding inheritance questions, accurately interpreting familial relationships, and correctly applying fixed-share rules, often supported by appropriate scriptural references. However, even these advanced models had their limitations. Gemini, for instance, occasionally struggled with nuanced distinctions between different Islamic legal schools (intra-madhhab distinctions), sometimes applying Shāfiʿī jurisprudence when the Māliki position was specifically required. It also faced difficulties in complex scenarios, such as differentiating between inheritance cases involving conversion to Islam before or after the deceased’s death.
When it came to complex errors, open-source models consistently failed in scenarios requiring proportional reduction (‘awl) and residual reallocation (radd). They often miscalculated the distribution denominator or became trapped in calculation loops, indicating a poor understanding of the sequential steps required in inheritance law. Gemini, while generally more proficient, sometimes inconsistently applied the principle of distribution denominator correction and occasionally failed to redistribute leftover shares. In cases involving juristic disagreement, all models tended to default to the majority opinion, likely due to biases in their training data, thereby overlooking valid minority views.
The study also underscored that simply achieving a correct answer is insufficient in legal contexts. An analysis of instances where lower-performing models produced correct answers revealed that their justifications often reflected the same foundational errors discussed earlier, highlighting a lack of robust underlying reasoning. This suggests that performance evaluations must account for the quality of reasoning, as accuracy alone can provide an incomplete and potentially misleading assessment of a model’s true capabilities in this domain.
Also Read:
- Local LLM Debates: A New Path to Enhanced AI Alignment Reasoning
- Enhancing Legal AI: A Structured Prompting Method for Long Documents
In conclusion, the research clearly demonstrates a substantial performance and reasoning gap between advanced commercial LLMs and open-source models when applied to Islamic inheritance law. The findings suggest that future research should prioritize the development of agentic AI systems capable of step-by-step reasoning, precise adherence to legal rules, and flexible adaptation to complex inheritance cases. This will necessitate the creation of high-quality, expert-guided datasets specifically designed to support and validate legal reasoning in this intricate domain. For more detailed information, you can access the full research paper here.


