TLDR: HeQ is a new, large, and diverse Hebrew Machine Reading Comprehension (MRC) benchmark dataset. It addresses the challenges of extractive Question Answering in morphologically rich languages like Hebrew by introducing novel annotation guidelines and a new evaluation metric called TLNLS. Experiments with HeQ show that multilingual models perform surprisingly well on Hebrew MRC tasks and emphasize the critical role of data quality and diversity in improving model performance for natural language understanding in Hebrew.
For years, the field of Natural Language Processing (NLP) for Hebrew has primarily concentrated on understanding the grammatical structure of the language. However, a significant gap remained in evaluating how well machines truly comprehend the meaning of Hebrew text. This challenge is particularly complex for Hebrew, a Morphologically Rich Language (MRL), where words can combine in many ways, making it difficult to pinpoint exact answer spans in text.
A new research paper introduces HeQ, a groundbreaking benchmark designed to bridge this gap. HeQ stands for Hebrew Question Answering, and it aims to provide a robust dataset for Machine Reading Comprehension (MRC) in Hebrew, specifically focusing on extractive Question Answering.
The Unique Challenges of Hebrew
Hebrew’s rich morphology means that a single word can carry as much information as several words in languages like English. This leads to issues in identifying precise answer spans. For instance, the Hebrew phrase for “in the house” is a single word, while “house” is another. Traditional evaluation metrics, like F1 Score and Exact Match (EM), which work well for English, often penalize models unfairly in Hebrew because they don’t account for these complex word structures and affixations (prefixes, suffixes, clitics).
These standard metrics might give a score of zero for a perfectly valid answer if its boundaries differ slightly due to Hebrew’s compounding nature, even if the meaning is entirely correct. This underestimation of model performance has been a significant hurdle.
Introducing HeQ: A New Standard for Hebrew MRC
To overcome these challenges, the creators of HeQ developed a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics tailored for morphologically rich languages. The resulting HeQ benchmark boasts 30,147 diverse question-answer pairs. These pairs are sourced from two distinct domains: Hebrew Wikipedia articles, offering a wide range of topics, and Israeli tech news from Geektime, providing varied text structures.
The dataset was meticulously created with four core principles: Diversity, Accuracy, Difficulty, and Quality over Quantity. Annotators were carefully selected and monitored, and questions were designed to require inference rather than simple word matching, making the benchmark more challenging and robust.
A New Metric for MRLs: TLNLS
One of HeQ’s most significant contributions is the proposal of a new evaluation metric: Token-Level Normalized Levenshtein Similarity (TLNLS). Unlike F1 or EM, TLNLS is less sensitive to minor changes in answer span boundaries caused by Hebrew’s unique morphology. It provides a more accurate reflection of a model’s performance by giving similar scores to words and their inflected variants, and to different spans representing the same answer. This language-independent metric is crucial for fairly assessing MRC models in Hebrew and other MRLs.
Also Read:
- Multi-TW: A New Benchmark for Multimodal AI in Traditional Chinese
- Human Touch in AI Evaluation: Introducing the FACTORY Dataset
Key Findings and Insights
Experiments conducted using HeQ revealed several surprising insights:
- **Multilingual Models Excel:** Surprisingly, mBERT, a multilingual model, outperformed all Hebrew-trained models, despite having been exposed to less Hebrew data during its training. This suggests that pretraining on a large corpus of diverse languages can significantly benefit Hebrew NLP models.
- **Quality Over Quantity:** The research highlighted that the quality and diversity of the data are as crucial as the sheer size of the dataset. Models trained on HeQ performed significantly better than those trained on the earlier ParaShoot dataset, even when using a smaller subset of HeQ.
- **Domain Impact:** Models trained on the Geektime (news) section of HeQ showed better performance and domain transferability compared to those trained on Wikipedia articles, likely due to the news domain’s more varied text structure.
In conclusion, HeQ marks a significant step forward for Hebrew Natural Language Understanding. It provides a high-quality, diverse benchmark and introduces a more appropriate evaluation metric, paving the way for the development of more sophisticated and accurate NLP models for Hebrew and other morphologically rich languages. You can explore the research paper in detail here: HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark.


