TLDR: A new benchmark called CL2GEC has been introduced to help Chinese Grammatical Error Correction (CGEC) systems adapt to different academic fields without forgetting what they’ve learned. It includes 10,000 human-annotated sentences from 10 disciplines and evaluates how well models can continually learn. Experiments show that methods which regulate how models update their knowledge are more effective at preventing forgetting than simply replaying old data or naive sequential training.
The field of automated writing assistance is rapidly expanding, especially for complex languages like Chinese. However, existing systems for Chinese Grammatical Error Correction (CGEC) often struggle when faced with the diverse linguistic styles and error patterns found across different academic disciplines. A major challenge is ‘catastrophic forgetting,’ where a model forgets previously learned knowledge when it’s trained on new information.
To address this critical gap, researchers have introduced CL2GEC, the first benchmark specifically designed for Continual Learning in Chinese Literature Grammatical Error Correction. This innovative benchmark aims to evaluate how well CGEC systems can adapt across multiple academic fields over time.
What is CL2GEC?
CL2GEC is a comprehensive dataset featuring 10,000 human-annotated sentences, carefully selected from 10 distinct academic disciplines. Each discipline, ranging from Law and Science to Economics and Literature, presents its own unique linguistic characteristics and common error types. The benchmark is structured to simulate real-world editorial scenarios, where a system might sequentially encounter papers from different fields, requiring it to continually refine its knowledge without losing past learning.
How Was CL2GEC Built?
The dataset was meticulously curated from the China National Knowledge Infrastructure (CNKI), a vast repository of Chinese academic texts. The process involved several stages:
- Data Collection: Sentences were sampled from 10 first-level disciplines, ensuring a balanced representation with 1,000 sentences per discipline.
- Data Cleaning: A rigorous pipeline converted PDFs to structured JSON, filtered out irrelevant sections (like references), segmented text into sentences, removed noise (citations, equations), and anonymized personal information.
- Data Annotation: To ensure high quality and cost-effectiveness, a human-in-the-loop strategy was employed. This involved initial automatic error detection, pre-correction by advanced language models like GPT-4o, independent human annotation by domain-aware graduates, and final expert validation to create a high-precision, multi-reference gold standard.
Evaluating Continual Learning
CL2GEC evaluates models in a continual learning setting, meaning models are trained sequentially on discipline-specific datasets. To understand the impact of learning order, experiments were conducted using both a randomized sequence of disciplines and a sequence ordered by semantic similarity.
Performance is measured using both standard Grammatical Error Correction (GEC) metrics (Precision, Recall, F0.5, calculated using the ChERRANT scorer) and specialized Continual Learning metrics:
- Backward Transfer (BWT): Quantifies how well a model retains knowledge of past tasks after learning new ones. A negative value indicates forgetting.
- Average Task Performance (AvgPerf): Measures the model’s overall GEC ability across all disciplines after completing the entire learning sequence.
Key Findings from Experiments
The researchers benchmarked several large language models, including Qwen2.5-7B-Instruct and Llama3-8B-Instruct, using various continual learning strategies:
- Baselines: Naive Sequential Finetuning (SeqFT) and Parameter-Efficient Tuning (LoRA).
- Replay-based Methods: Retaining a small percentage (2%, 5%, or 10%) of data from previous tasks.
- Continual Learning Algorithms: Including Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), Gradient Episodic Memory (GEM), and Orthogonal Gradient Descent (OGD).
Here are some significant observations:
- Model Choice Matters: Qwen2.5-7B-Instruct consistently outperformed Llama3-8B-Instruct, likely due to its stronger pretraining on Chinese language data.
- LoRA’s Effectiveness: LoRA, a lightweight adaptation method, significantly improved performance over naive sequential training, establishing itself as a strong baseline.
- Regularization Excels: Regularization-based continual learning methods (like EWC, GEM, LwF, OGD) generally proved more effective at mitigating catastrophic forgetting compared to simple replay or naive sequential approaches. These methods optimize model parameters to maintain overall task stability.
- OGD for Overall Performance: Orthogonal Gradient Descent (OGD) achieved the highest average performance, focusing on acquiring new tasks effectively, though sometimes at the cost of slightly lower knowledge retention for older tasks.
- GEM for Knowledge Retention: Gradient Episodic Memory (GEM) demonstrated strong backward transfer, meaning it was particularly good at preserving knowledge across tasks, especially when tasks were semantically related.
- Task Order Impact: The order in which disciplines are learned has a nuanced effect. A semantically similar order generally improved recall and overall average performance but could lead to a decline in precision. Interestingly, Qwen models showed a decline in backward transfer with semantic ordering, while LLaMA models showed improvement, suggesting model-specific sensitivities to task similarity.
- Replay Limitations: Simply increasing the replay buffer size did not consistently improve performance. In some cases, larger buffers even degraded results, highlighting that replay alone is often insufficient and needs to be combined with more adaptive mechanisms.
Also Read:
- Protecting Sensitive Data in AI: A New Approach for Continual Learning
- Evaluating AI’s Medical Accuracy: A New Benchmark for Chinese Healthcare Texts
Looking Ahead
CL2GEC provides a robust foundation for future research in adaptive grammatical error correction. By simulating real-world domain-incremental learning, it encourages the development of more sophisticated lifelong writing assistants that can generalize effectively across diverse and specialized academic domains, ultimately enhancing automated writing support for Chinese literature.


