Evaluating AI's Ability to Generate Accurate Islamic Content

TLDR: This research evaluates GPT-4o, Ansari AI, and Fanar’s ability to generate faithful Islamic content using a dual-agent framework. The study found that while GPT-4o and Ansari AI performed relatively well in accuracy and style, all models struggled with reliable citation and contextual integrity. It highlights the need for community-driven benchmarks and human oversight for AI in faith-sensitive domains.

Large language models (LLMs) are increasingly being used to provide Islamic guidance, but this comes with significant risks. These models can misquote religious texts, incorrectly apply Islamic law, or produce responses that are culturally inconsistent. A new study titled Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content investigates this critical issue.

The research, conducted by Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, and Junaid Qadir, piloted an evaluation of three prominent LLMs: GPT-4o, Ansari AI, and Fanar. The models were tested using prompts derived from authentic Islamic blogs, covering diverse topics such as Jurisprudence (Fiqh), Qur’anic Exegesis (Tafsir), Hadith Sciences (Ulum al-Hadith), Theology (Aqidah), and Spiritual Conduct (Adab).

A Dual-Agent Evaluation Framework

To thoroughly assess the LLMs, the researchers developed a unique dual-agent framework. This framework includes two main components:

Quantitative Agent: This agent is responsible for verifying citations and scoring the LLM-generated essays across six dimensions: Structural Coherence, Thematic Focus, Clarity, Originality, Islamic Accuracy, and Citation/Islamic Source Use. It uses verification tools to check Qur’anic verses, Hadiths, and other source texts, flagging references as confirmed, partially confirmed, unverified, or refuted.
Qualitative Agent: This agent performs a deeper, context-aware analysis through side-by-side comparisons of the LLM outputs. It evaluates responses across five dimensions: Clarity & Structure, Islamic Accuracy, Tone & Appropriateness, Depth & Originality, and Comparative Reflection. This agent highlights specific wording choices and rhetorical strategies, providing justification-driven assessments.

Key Findings on LLM Performance

The study revealed a clear performance hierarchy among the evaluated models:

GPT-4o: Achieved the highest overall mean quantitative score (3.90 out of 5) and demonstrated the lowest variability in its responses. It particularly excelled in Islamic Accuracy (3.93) and Citation (3.38), as well as in stylistic elements like Theme and Structure. Qualitatively, GPT-4o showed strength in Tone & Appropriateness and Depth & Originality.
Ansari AI: Followed closely with an average quantitative score of 3.79. It performed very similarly to GPT-4o in Islamic Accuracy and Citation. In the qualitative assessment, Ansari AI received the most “Best” verdicts, indicating strong performance in clarity, religious fidelity, and depth.
Fanar: Trailed with an average quantitative score of 3.04 and showed greater fluctuation in its performance. It struggled particularly in Originality, Islamic Accuracy, and Citation. Qualitatively, Fanar received the most “Worst” verdicts across multiple dimensions, highlighting challenges in linguistic and theological aspects. However, the study notes that Fanar introduces innovations for Islamic and Arabic contexts, suggesting potential for improvement with scaling.

Despite the relatively strong performance of GPT-4o and Ansari AI, a significant finding was that all models still fall short in reliably producing accurate Islamic content and citations. This is a critical requirement for faith-sensitive writing, where even minor errors can lead to misinformation or harm.

Also Read:

Implications and Future Directions

The research underscores the urgent need for community-driven benchmarks that center Muslim perspectives in evaluating AI for Islamic knowledge. The framework proposed in this study offers an early step toward more reliable AI in high-stakes domains, including medicine, law, and journalism, which also demand high levels of truthfulness and contextual integrity.

The authors suggest future work should address evaluator bias by using a diverse ensemble of evaluator LLMs, expand beyond the current 50 prompts to include more diverse cases and multilingual validation, and involve multi-expert human validation panels of Islamic scholars. Ultimately, responsible use of LLMs in faith-sensitive contexts will require clear disclaimers, mandatory scholar oversight, and evaluation methods that reflect diverse Islamic perspectives, ensuring AI assists rather than replaces human religious scholarship.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating AI’s Ability to Generate Accurate Islamic Content

A Dual-Agent Evaluation Framework

Key Findings on LLM Performance

Implications and Future Directions

Gen AI News and Updates

Legal AI Startup Theo Ai Secures $3.4 Million to Advance Predictive Litigation Tools

Customizable AI for Document Evaluation: Introducing DOCUEVAL

Geographic Disparities in AI’s Legal Knowledge: A Study on LLM Hallucinations

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates