TLDR: OmniGEC is a new collection of silver-standard multilingual datasets for Grammatical Error Correction (GEC), covering 11 languages including Czech, English, and Ukrainian. It combines human-edited Wikipedia texts with automatically corrected Reddit and UberText 2.0 data, generated using GPT-4o-mini. Experiments show that fine-tuning large language models like Aya-Expanse and Gemma-3 on OmniGEC significantly improves performance, achieving state-of-the-art results for paragraph-level multilingual GEC, particularly benefiting larger models like Gemma-3.
Grammatical Error Correction (GEC) is a crucial task in Natural Language Processing (NLP) that identifies and rectifies grammatical mistakes in written text. While significant strides have been made in GEC for high-resource languages like English, its development for multilingual contexts, especially for underrepresented languages such as Ukrainian, Czech, and Slovene, has faced challenges due to a lack of high-quality datasets.
Addressing this gap, researchers Roman Kovalchuk, Mariana Romanyshyn, and Petro Ivaniuk have introduced OmniGEC, a comprehensive collection of multilingual silver-standard datasets designed specifically for GEC. This innovative dataset covers eleven languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Slovene, Swedish, and Ukrainian. OmniGEC aims to foster the creation of robust multilingual GEC solutions and bridge the existing data divide.
The Genesis of OmniGEC: Data Sources and Correction Methods
The texts within the OmniGEC datasets originate from three distinct sources. The first, WikiEdits-MultiGEC, comprises human-made corrections derived from Wikipedia edits across the eleven target languages. These edits, specifically from the ‘newcomer task copyedit’ category, represent genuine human efforts to correct small grammatical mistakes.
The other two sources, Reddit-MultiGEC and UberText-GEC, leverage automatically corrected data. Reddit-MultiGEC is a large multilingual corpus of posts scraped from Reddit subreddits, while UberText-GEC is a Ukrainian-only social media corpus. For these datasets, corrections were synthetically generated using the GPT-4o-mini model. The process involved a sophisticated three-step approach: first, generating GEC instructions and few-shot prompts for each language using DeepL for translation and the o1-preview model; second, using GPT-4o-mini to generate three possible corrections for each text sample; and finally, aggregating these multiple corrections into a single, more complete correction, again using GPT-4o-mini.
Evaluating Quality and Performance
The quality of corrections within the OmniGEC datasets underwent rigorous evaluation, combining both automated metrics and human feedback. Due to practical constraints, human evaluation was conducted specifically for the Ukrainian-language subcorpora. This involved a grading task where native Ukrainian speakers assessed corrections on a scale from 1 to 5. Interestingly, the human evaluation revealed that the synthetically generated corrections from Reddit-MultiGEC and UberText-GEC generally exhibited better quality than the human-made corrections found in WikiEdits-MultiGEC.
To demonstrate the practical utility of OmniGEC, the researchers fine-tuned two open-source large language models (LLMs): Aya-Expanse (8B) and Gemma-3 (12B). These models were trained on the multilingual OmniGEC corpora in conjunction with the MultiGEC-2025 train set. The experiments showed significant performance improvements, with both models yielding better results when trained on the combined datasets. Notably, Gemma-3, a larger model, demonstrated a greater benefit from the extensive fine-tuning with OmniGEC data, ultimately achieving state-of-the-art results for paragraph-level multilingual GEC.
Also Read:
- Advancing Chinese Grammatical Error Correction for Academic Writing Across Disciplines
- Beyond Syntax: GBV-SQL Validates User Intent in Text-to-SQL Generation
Looking Ahead
While OmniGEC marks a substantial step forward, the researchers acknowledge certain limitations, including coverage of only eleven languages and human evaluation restricted to Ukrainian. Future work aims to expand language coverage, assess more models, explore sentence-based editing, and delve deeper into ablation studies and preference optimization methods. The OmniGEC dataset collection and the best-performing models are openly available on Hugging Face, fostering continued research and development in multilingual GEC. For more details, you can refer to the full research paper here.


