TLDR: A new post-processing technique called mucAI improves Arabic readability assessment by using conformal prediction to generate prediction sets with statistical guarantees. This method, which reduces high-penalty misclassifications, consistently boosts Quadratic Weighted Kappa (QWK) scores by 1-3 points across various models in the BAREC 2025 Shared Task, making predictions more reliable and interpretable for educational applications.
Assessing how difficult an Arabic text is for a target audience, known as readability assessment, is a crucial task for educational applications. However, this presents unique challenges due to the rich morphology and varied orthography of the Arabic language. Even advanced models often struggle with significant misclassifications and lack a clear way to quantify their prediction uncertainty.
A new approach, named mucAI, addresses these issues by introducing a simple, model-agnostic post-processing technique for fine-grained Arabic readability classification. This method was developed for the BAREC 2025 Shared Task, which involves classifying Arabic texts into 19 ordinal readability levels.
Understanding the mucAI Method
The core of the mucAI method lies in integrating conformal prediction. Instead of providing a single, definitive readability level (e.g., “Level 9”), conformal prediction generates a “prediction set” – a range of plausible levels (e.g., “Level 7, 8, 9, 10, or 11”). The key benefit here is that this set comes with statistical guarantees, meaning there’s a high probability that the true readability level is contained within this predicted range. This provides a principled way to quantify prediction uncertainty.
After generating these prediction sets, the mucAI method computes weighted averages using probabilities within these conformal sets. This “uncertainty-aware decoding” is designed to improve the overall accuracy by reducing severe misclassifications, particularly those where the predicted level is far from the actual level. For instance, reducing an error from four levels away to just one level away significantly impacts the evaluation metric.
Performance and Impact
The mucAI approach consistently demonstrated improvements in the Quadratic Weighted Kappa (QWK) score, which is the primary evaluation metric for the BAREC Shared Task. QWK heavily penalizes larger misclassifications, making it a suitable metric for educational contexts where assigning a text far from a student’s actual reading level can be detrimental.
Across different base models, mucAI achieved consistent QWK improvements of 1-3 points. In the strict track of the BAREC 2025 Shared Task, the submission achieved impressive QWK scores of 84.9% on the test set and 85.7% on the blind test set for sentence-level assessment. For document-level assessment, it reached 73.3%.
One interesting finding was that while the exact accuracy (predicting the precise level) might slightly decrease, the QWK score significantly improves. This is because the method effectively shrinks many large errors into smaller, less penalized ones. For example, changing a prediction that was four levels off to one that is only one level off drastically reduces the penalty due to the squared distance calculation in QWK.
Also Read:
- Enhancing Reinforcement Learning with Adaptive Demonstration Guidance
- Enhancing AI Trustworthiness: A New Approach to Curb Reward Hacking in Medical LLMs
Practical Applications and Future Directions
Beyond just improving leaderboard scores, the mucAI method offers significant practical usability. By providing compact and interpretable prediction sets, it allows human reviewers in Arabic educational assessment to focus on a handful of plausible readability levels. This combines statistical guarantees with real-world applicability, making the assessment process more reliable and efficient.
The research also highlighted that the coverage reliability of the prediction sets can vary across different text domains, with Social Sciences texts showing higher failure rates compared to Arts & Humanities. This suggests potential for future work, such as using domain-adaptive calibration strategies to further enhance reliability.
For more in-depth details, you can refer to the full research paper: mucAI at BAREC Shared Task 2025: Towards Uncertainty Aware Arabic Readability Assessment.


