OmniGEC: A New Multilingual Dataset Elevates Grammatical Error Correction

TLDR: OmniGEC is a new collection of silver-standard multilingual datasets for Grammatical Error Correction (GEC), covering 11 languages including Czech, English, and Ukrainian. It combines human-edited Wikipedia texts with automatically corrected Reddit and UberText 2.0 data, generated using GPT-4o-mini. Experiments show that fine-tuning large language models like Aya-Expanse and Gemma-3 on OmniGEC significantly improves performance, achieving state-of-the-art results for paragraph-level multilingual GEC, particularly benefiting larger models like Gemma-3.

Grammatical Error Correction (GEC) is a crucial task in Natural Language Processing (NLP) that identifies and rectifies grammatical mistakes in written text. While significant strides have been made in GEC for high-resource languages like English, its development for multilingual contexts, especially for underrepresented languages such as Ukrainian, Czech, and Slovene, has faced challenges due to a lack of high-quality datasets.

Addressing this gap, researchers Roman Kovalchuk, Mariana Romanyshyn, and Petro Ivaniuk have introduced OmniGEC, a comprehensive collection of multilingual silver-standard datasets designed specifically for GEC. This innovative dataset covers eleven languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Slovene, Swedish, and Ukrainian. OmniGEC aims to foster the creation of robust multilingual GEC solutions and bridge the existing data divide.

The Genesis of OmniGEC: Data Sources and Correction Methods

The texts within the OmniGEC datasets originate from three distinct sources. The first, WikiEdits-MultiGEC, comprises human-made corrections derived from Wikipedia edits across the eleven target languages. These edits, specifically from the ‘newcomer task copyedit’ category, represent genuine human efforts to correct small grammatical mistakes.

The other two sources, Reddit-MultiGEC and UberText-GEC, leverage automatically corrected data. Reddit-MultiGEC is a large multilingual corpus of posts scraped from Reddit subreddits, while UberText-GEC is a Ukrainian-only social media corpus. For these datasets, corrections were synthetically generated using the GPT-4o-mini model. The process involved a sophisticated three-step approach: first, generating GEC instructions and few-shot prompts for each language using DeepL for translation and the o1-preview model; second, using GPT-4o-mini to generate three possible corrections for each text sample; and finally, aggregating these multiple corrections into a single, more complete correction, again using GPT-4o-mini.

Evaluating Quality and Performance

The quality of corrections within the OmniGEC datasets underwent rigorous evaluation, combining both automated metrics and human feedback. Due to practical constraints, human evaluation was conducted specifically for the Ukrainian-language subcorpora. This involved a grading task where native Ukrainian speakers assessed corrections on a scale from 1 to 5. Interestingly, the human evaluation revealed that the synthetically generated corrections from Reddit-MultiGEC and UberText-GEC generally exhibited better quality than the human-made corrections found in WikiEdits-MultiGEC.

To demonstrate the practical utility of OmniGEC, the researchers fine-tuned two open-source large language models (LLMs): Aya-Expanse (8B) and Gemma-3 (12B). These models were trained on the multilingual OmniGEC corpora in conjunction with the MultiGEC-2025 train set. The experiments showed significant performance improvements, with both models yielding better results when trained on the combined datasets. Notably, Gemma-3, a larger model, demonstrated a greater benefit from the extensive fine-tuning with OmniGEC data, ultimately achieving state-of-the-art results for paragraph-level multilingual GEC.

Also Read:

Looking Ahead

While OmniGEC marks a substantial step forward, the researchers acknowledge certain limitations, including coverage of only eleven languages and human evaluation restricted to Ukrainian. Future work aims to expand language coverage, assess more models, explore sentence-based editing, and delve deeper into ablation studies and preference optimization methods. The OmniGEC dataset collection and the best-performing models are openly available on Hugging Face, fostering continued research and development in multilingual GEC. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

OmniGEC: A New Multilingual Dataset Elevates Grammatical Error Correction

The Genesis of OmniGEC: Data Sources and Correction Methods

Evaluating Quality and Performance

Looking Ahead

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates