TLDR: This research evaluates GPT-4o, Ansari AI, and Fanar’s ability to generate faithful Islamic content using a dual-agent framework. The study found that while GPT-4o and Ansari AI performed relatively well in accuracy and style, all models struggled with reliable citation and contextual integrity. It highlights the need for community-driven benchmarks and human oversight for AI in faith-sensitive domains.
Large language models (LLMs) are increasingly being used to provide Islamic guidance, but this comes with significant risks. These models can misquote religious texts, incorrectly apply Islamic law, or produce responses that are culturally inconsistent. A new study titled Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content investigates this critical issue.
The research, conducted by Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, and Junaid Qadir, piloted an evaluation of three prominent LLMs: GPT-4o, Ansari AI, and Fanar. The models were tested using prompts derived from authentic Islamic blogs, covering diverse topics such as Jurisprudence (Fiqh), Qur’anic Exegesis (Tafsir), Hadith Sciences (Ulum al-Hadith), Theology (Aqidah), and Spiritual Conduct (Adab).
A Dual-Agent Evaluation Framework
To thoroughly assess the LLMs, the researchers developed a unique dual-agent framework. This framework includes two main components:
-
Quantitative Agent: This agent is responsible for verifying citations and scoring the LLM-generated essays across six dimensions: Structural Coherence, Thematic Focus, Clarity, Originality, Islamic Accuracy, and Citation/Islamic Source Use. It uses verification tools to check Qur’anic verses, Hadiths, and other source texts, flagging references as confirmed, partially confirmed, unverified, or refuted.
-
Qualitative Agent: This agent performs a deeper, context-aware analysis through side-by-side comparisons of the LLM outputs. It evaluates responses across five dimensions: Clarity & Structure, Islamic Accuracy, Tone & Appropriateness, Depth & Originality, and Comparative Reflection. This agent highlights specific wording choices and rhetorical strategies, providing justification-driven assessments.
Key Findings on LLM Performance
The study revealed a clear performance hierarchy among the evaluated models:
-
GPT-4o: Achieved the highest overall mean quantitative score (3.90 out of 5) and demonstrated the lowest variability in its responses. It particularly excelled in Islamic Accuracy (3.93) and Citation (3.38), as well as in stylistic elements like Theme and Structure. Qualitatively, GPT-4o showed strength in Tone & Appropriateness and Depth & Originality.
-
Ansari AI: Followed closely with an average quantitative score of 3.79. It performed very similarly to GPT-4o in Islamic Accuracy and Citation. In the qualitative assessment, Ansari AI received the most “Best” verdicts, indicating strong performance in clarity, religious fidelity, and depth.
-
Fanar: Trailed with an average quantitative score of 3.04 and showed greater fluctuation in its performance. It struggled particularly in Originality, Islamic Accuracy, and Citation. Qualitatively, Fanar received the most “Worst” verdicts across multiple dimensions, highlighting challenges in linguistic and theological aspects. However, the study notes that Fanar introduces innovations for Islamic and Arabic contexts, suggesting potential for improvement with scaling.
Despite the relatively strong performance of GPT-4o and Ansari AI, a significant finding was that all models still fall short in reliably producing accurate Islamic content and citations. This is a critical requirement for faith-sensitive writing, where even minor errors can lead to misinformation or harm.
Also Read:
- Assessing the Dependability of AI in Academic Research: Insights from the PaperAsk Benchmark
- Mapping Religious Language in Climate Change Discourse
Implications and Future Directions
The research underscores the urgent need for community-driven benchmarks that center Muslim perspectives in evaluating AI for Islamic knowledge. The framework proposed in this study offers an early step toward more reliable AI in high-stakes domains, including medicine, law, and journalism, which also demand high levels of truthfulness and contextual integrity.
The authors suggest future work should address evaluator bias by using a diverse ensemble of evaluator LLMs, expand beyond the current 50 prompts to include more diverse cases and multilingual validation, and involve multi-expert human validation panels of Islamic scholars. Ultimately, responsible use of LLMs in faith-sensitive contexts will require clear disclaimers, mandatory scholar oversight, and evaluation methods that reflect diverse Islamic perspectives, ensuring AI assists rather than replaces human religious scholarship.


