TLDR: A new rule-based morphological synthesizer for the ancient Ge’ez language has been developed, achieving 97.4% accuracy in generating words from roots. This pioneering work addresses the language’s complex morphology and lack of digital resources, offering the first publicly available datasets and an algorithm for Ge’ez word formation, crucial for its preservation and future NLP applications.
The Ge’ez language, an ancient Semitic language with a unique alphabet, holds significant cultural and religious importance in Ethiopia and Eritrea. It served as the script for languages like Tigrinya and Amharic and was crucial during the Aksumite kingdom era. Despite its historical and ongoing liturgical significance, Ge’ez faces challenges in the realm of Natural Language Processing (NLP) due to its complex morphological structure and a severe scarcity of annotated linguistic data, corpora, and lexicons. This lack of resources has hindered the development of usable NLP tools for Ge’ez.
To address these limitations, researchers Gebrearegawi Gebremariam, Hailay Teklehaymanot, and Gebregewergs Mezgebe proposed a rule-based Ge’ez morphological synthesizer. This innovative system aims to automatically generate surface words from root words, adhering to the intricate morphological rules of the language. The project is a pioneering effort, as no prior research has successfully developed an automatic morphological generator for Ge’ez.
System Design and Methodology
The core of the proposed system lies in its rule-based approach, specifically utilizing the Two-Level Model (TLM) of morphology. This model is well-suited for languages with limited resources like Ge’ez, as it allows for faster development and better accuracy by formulating rules based on expert linguistic knowledge. The synthesizer’s design incorporates several key components: a Stem Classifier to identify verb categories and regularity, a Stem Formation component to generate derived stems, a Signature Builder to match stems with valid affixes, a Boundary Change Handler to manage spelling changes during morpheme concatenation, and the Synthesizer itself, which generates all possible surface word forms.
The researchers compiled the first publicly available dataset for Ge’ez morphological synthesizers, consisting of 1,102 sample verbs representing all verb morphological structures. This dataset was crucial for testing and evaluating the system. The evaluation involved both manual assessment by language experts and automatic evaluation using predefined metrics. The system achieved an impressive overall average accuracy of 97.4%. This performance surpasses baseline models and highlights the effectiveness of the rule-based TLM approach for Ge’ez.
Also Read:
- Advancing Sentiment Analysis for Central Kurdish with BERT
- Unlocking Digital Services for Wolof Speakers: Introducing the WolBanking77 Dataset
Key Contributions and Future Outlook
The high performance is attributed to several factors, including the correct generation of stems, proper handling of rules during morpheme concatenation, and effective management of irregular verb formations, which are prevalent in Ge’ez. However, the study also identified areas for improvement, such as errors caused by exceptional characters in verbs, issues during the concatenation of certain words with affixes, and the inherent richness and varied nature of Ge’ez morphology. Some errors also stemmed from missing specific rules in the initial design.
This research makes fundamental contributions to the scientific community by providing an algorithm based on Ge’ez morphological rules, creating the first publicly available datasets, and offering Amharic and English meanings for perfect verb forms, which could spur the development of Ge’ez-Amharic or Ge’ez-Tigrinya dictionaries. The project underscores the importance of preserving the Ge’ez language, which is deeply intertwined with Ethiopia and Eritrea’s cultural and historical heritage.
For more detailed information, you can refer to the full research paper: Morphological Synthesizer for Ge’ez Language: Addressing Morphological Complexity and Resource Limitations.


