AI's Sacred Challenge: Evaluating Language Models for Islamic Jurisprudence

TLDR: A new study introduces FiqhQA, a benchmark to evaluate Large Language Models (LLMs) on Islamic legal questions across four Sunni schools of thought in Arabic and English. It assesses both accuracy and the ability of LLMs to abstain from answering when uncertain. Findings show GPT-4o is most accurate in English but less likely to abstain, leading to confident errors. Gemini and Fanar exhibit better abstention, especially in Arabic. All models perform worse in Arabic, highlighting language limitations and the need for cautious deployment and expert involvement in religious AI.

Large Language Models (LLMs) are increasingly used for answering questions across many fields, but their reliability in sensitive areas like religious guidance has remained largely unexplored. Many individuals now turn to these AI systems for answers on Islamic law and practice. However, traditional Islamic rulings, known as Fatwas, are typically issued by highly trained scholars through a rigorous process, often grounded within one of the four major Sunni schools of thought: Hanafi, Maliki, Shafi‘i, and Hanbali. Each school has distinct legal methodologies, ensuring authenticity and consistency. This context makes it crucial for LLMs to not only answer accurately but also to be aware of these jurisprudential differences.

The Challenge of Religious Question Answering for AI

Previous efforts to develop Islamic Question Answering (QA) systems have faced several limitations. Most existing datasets categorize questions by topic but fail to consider the distinctions between these four widely followed schools of thought. Furthermore, while many studies have focused on fine-tuning models for Islamic QA, little attention has been paid to assessing LLMs’ ability to abstain from answering when they are unsure. This “abstention behavior” is vital, as LLMs can sometimes “hallucinate” or provide confidently incorrect answers, posing significant risks in high-stakes domains like religion. As a popular proverb among jurists states, “Whomsoever says: ‘I don’t know’ has amassed half of knowledge.”

Introducing FiqhQA: A New Benchmark

To address these gaps, a recent research paper introduces a novel benchmark called FiqhQA. This dataset is specifically designed to evaluate LLMs on Islamic rulings, explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. The dataset comprises 960 question-answer pairs, derived from 120 unique questions available in both languages, with four corresponding answers for each school of thought. The primary source for this dataset was the authoritative Kuwaiti Fiqh Encyclopedia, known for its systematic organization and clear presentation of differing positions among the schools. GPT-4o was used to generate initial question-answer pairs, which were then meticulously reviewed and validated by human experts, including native Arabic speakers and annotators for English translations.

Evaluating LLM Performance and Abstention

The study conducted two main types of experiments: zero-shot QA and abstention. For the zero-shot experiments, LLMs were evaluated on their ability to answer questions directly from the FiqhQA dataset. For abstention experiments, a subset of questions was used with specific prompts designed to encourage the models to say “I don’t know” if they were uncertain. Two variants of these prompts were used: a “basic” and a “strict” version, with the strict version including additional warnings about the consequences of incorrect answers. Six different LLMs were tested, including closed-source models like GPT-4o and Gemini 2.0 Flash, and open-source models like Fanar, Allam, Aya-Expense-8B, and Gemma-2-9B-IT. The evaluation involved both automated assessment using GPT-4o as a judge and human cross-validation to ensure accuracy and consistency.

Key Findings: Accuracy, Abstention, and Language Barriers

The results revealed significant variations across models and languages. In English, GPT-4o emerged as the top performer, with 46% of its answers being fully correct. It showed particular strength in answering questions related to the Hanafi school of thought. Gemini 2.0 Flash, while trailing in overall accuracy, had the lowest percentage of completely wrong answers. However, all models exhibited a noticeable performance drop when answering questions in Arabic, suggesting that even advanced LLMs are more reliable in English for complex jurisprudential reasoning, likely due to the predominance of English in their training data.

Regarding abstention, Gemini and Fanar demonstrated superior capabilities, especially for Arabic questions. Gemini, for instance, showed an abstention rate of 90% with the basic prompt, indicating a more conservative and reliable strategy. GPT-4o, in contrast, was less likely to abstain, often producing incorrect answers with high confidence. The “strict” abstention prompts generally improved the abstention behavior of Fanar and Gemini, while GPT-4o still abstained less frequently compared to the other models.

Also Read:

Implications and Future Directions

These findings highlight several critical insights. Firstly, the varying accuracy across different schools of thought, particularly GPT-4o’s better performance on Hanafi-related questions, suggests a bias in training data, likely due to the greater availability of Hanafi material online. This points to a need for more diverse and balanced training data for LLMs in religious domains. Secondly, the consistent performance drop in Arabic underscores the limitations of current multilingual LLMs in handling nuanced religious reasoning in languages other than English.

The study also emphasizes that despite improvements in abstention, LLMs are fundamentally probabilistic language models, not knowledge-grounded reasoners. This means they generate responses based on likelihood, not absolute certainty, leading to persistent incorrect answers. The paper argues that religious reasoning, or “ijtihad,” is inherently a human endeavor, requiring sincerity and effort that AI cannot replicate. Therefore, AI systems in this domain should not aim to replace human scholars but rather to work alongside them, with domain experts embedded in the design and evaluation process.

Crucially, accountability and transparency are paramount. Users must be made aware that LLM outputs are probabilistic and not guaranteed to be accurate, preventing over-reliance. Future work should focus on refining abstention mechanisms, increasing the diversity of training data to ensure equitable coverage of all Islamic legal traditions, and improving the interpretability and trustworthiness of these systems in multilingual religious contexts. For more details, you can refer to the full research paper: Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI’s Sacred Challenge: Evaluating Language Models for Islamic Jurisprudence

The Challenge of Religious Question Answering for AI

Introducing FiqhQA: A New Benchmark

Evaluating LLM Performance and Abstention

Key Findings: Accuracy, Abstention, and Language Barriers

Implications and Future Directions

Gen AI News and Updates

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

Unpacking AI Reliability: A Layered Approach to System Failures and Organizational Preparedness

The Licensing Oracle: A Structural Fix for Language Model Hallucinations

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates