spot_img
HomeResearch & DevelopmentEvaluating AI's Ability to Generate Accurate Islamic Content

Evaluating AI’s Ability to Generate Accurate Islamic Content

TLDR: This research evaluates GPT-4o, Ansari AI, and Fanar’s ability to generate faithful Islamic content using a dual-agent framework. The study found that while GPT-4o and Ansari AI performed relatively well in accuracy and style, all models struggled with reliable citation and contextual integrity. It highlights the need for community-driven benchmarks and human oversight for AI in faith-sensitive domains.

Large language models (LLMs) are increasingly being used to provide Islamic guidance, but this comes with significant risks. These models can misquote religious texts, incorrectly apply Islamic law, or produce responses that are culturally inconsistent. A new study titled Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content investigates this critical issue.

The research, conducted by Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, and Junaid Qadir, piloted an evaluation of three prominent LLMs: GPT-4o, Ansari AI, and Fanar. The models were tested using prompts derived from authentic Islamic blogs, covering diverse topics such as Jurisprudence (Fiqh), Qur’anic Exegesis (Tafsir), Hadith Sciences (Ulum al-Hadith), Theology (Aqidah), and Spiritual Conduct (Adab).

A Dual-Agent Evaluation Framework

To thoroughly assess the LLMs, the researchers developed a unique dual-agent framework. This framework includes two main components:

  • Quantitative Agent: This agent is responsible for verifying citations and scoring the LLM-generated essays across six dimensions: Structural Coherence, Thematic Focus, Clarity, Originality, Islamic Accuracy, and Citation/Islamic Source Use. It uses verification tools to check Qur’anic verses, Hadiths, and other source texts, flagging references as confirmed, partially confirmed, unverified, or refuted.

  • Qualitative Agent: This agent performs a deeper, context-aware analysis through side-by-side comparisons of the LLM outputs. It evaluates responses across five dimensions: Clarity & Structure, Islamic Accuracy, Tone & Appropriateness, Depth & Originality, and Comparative Reflection. This agent highlights specific wording choices and rhetorical strategies, providing justification-driven assessments.

Key Findings on LLM Performance

The study revealed a clear performance hierarchy among the evaluated models:

  • GPT-4o: Achieved the highest overall mean quantitative score (3.90 out of 5) and demonstrated the lowest variability in its responses. It particularly excelled in Islamic Accuracy (3.93) and Citation (3.38), as well as in stylistic elements like Theme and Structure. Qualitatively, GPT-4o showed strength in Tone & Appropriateness and Depth & Originality.

  • Ansari AI: Followed closely with an average quantitative score of 3.79. It performed very similarly to GPT-4o in Islamic Accuracy and Citation. In the qualitative assessment, Ansari AI received the most “Best” verdicts, indicating strong performance in clarity, religious fidelity, and depth.

  • Fanar: Trailed with an average quantitative score of 3.04 and showed greater fluctuation in its performance. It struggled particularly in Originality, Islamic Accuracy, and Citation. Qualitatively, Fanar received the most “Worst” verdicts across multiple dimensions, highlighting challenges in linguistic and theological aspects. However, the study notes that Fanar introduces innovations for Islamic and Arabic contexts, suggesting potential for improvement with scaling.

Despite the relatively strong performance of GPT-4o and Ansari AI, a significant finding was that all models still fall short in reliably producing accurate Islamic content and citations. This is a critical requirement for faith-sensitive writing, where even minor errors can lead to misinformation or harm.

Also Read:

Implications and Future Directions

The research underscores the urgent need for community-driven benchmarks that center Muslim perspectives in evaluating AI for Islamic knowledge. The framework proposed in this study offers an early step toward more reliable AI in high-stakes domains, including medicine, law, and journalism, which also demand high levels of truthfulness and contextual integrity.

The authors suggest future work should address evaluator bias by using a diverse ensemble of evaluator LLMs, expand beyond the current 50 prompts to include more diverse cases and multilingual validation, and involve multi-expert human validation panels of Islamic scholars. Ultimately, responsible use of LLMs in faith-sensitive contexts will require clear disclaimers, mandatory scholar oversight, and evaluation methods that reflect diverse Islamic perspectives, ensuring AI assists rather than replaces human religious scholarship.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -