TLDR: A new research paper introduces Pattern-aware Line-level Deduplication (PLD) and Pattern-aware Trailing Punctuation Filtering (PTF) to improve the quality of pretraining corpora for Large Language Models (LLMs). These methods go beyond traditional line-level filtering by considering the sequential distribution of text-quality signals across documents. By retaining structurally important content often discarded by conventional techniques, PLD and PTF consistently enhance LLM performance on multiple-choice benchmarks and significantly boost generative question-answering accuracy in both English and Korean.
Large Language Models (LLMs) have transformed the landscape of natural language processing, but their impressive capabilities heavily rely on the quality of their training data. Much of this data comes from vast web archives like CommonCrawl. To make this raw web data useful, various filtering techniques are applied to remove irrelevant or low-quality text. However, a recent study highlights a critical flaw in traditional filtering methods: they often discard valuable content, inadvertently hindering the performance of LLMs.
Researchers Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, and Jaejin Lee have introduced a novel approach to data filtering that moves “Beyond Line-Level Filtering for the Pretraining Corpora of LLMs”. Their work, detailed in their paper available at arXiv:2510.24139, proposes two enhanced methods: Pattern-aware Line-level Deduplication (PLD) and Pattern-aware Trailing Punctuation Filtering (PTF).
The Problem with Traditional Filtering
Common filtering techniques, such as simply removing duplicate lines or lines without trailing punctuation, are widely used. While these methods aim to clean data and reduce redundancy, the researchers found that they can be too aggressive. For instance, structural elements like section headers or repeated phrases in a document might be crucial for understanding but are often flagged as boilerplate or uninformative by these basic rules. This can lead to a loss of context and negatively impact how well an LLM performs on tasks like question answering.
Pattern-Aware Line-Level Deduplication (PLD)
PLD refines the concept of deduplication by not just looking at individual lines, but also considering their frequency across many documents and their sequence within a single document. Each line is categorized into one of three groups based on how often it appears in a large dataset:
- Red: Highly repetitive lines (e.g., appearing over 1,000 times for English or 50 times for Korean). These are often boilerplate.
- Yellow: Undecidable lines (e.g., appearing more than once for English or more than three times for Korean).
- Green: Distinctive lines (all others). These are likely unique and informative.
Instead of discarding all ‘Red’ or ‘Yellow’ lines, PLD looks for patterns. It retains sequences of lines that show structural importance, such as two or more consecutive ‘Green’ lines, or ‘Yellow’/’Red’ lines embedded within ‘Green’ sections. This ensures that important structural cues, which might be repetitive but contextually significant, are preserved.
Pattern-Aware Trailing Punctuation Filtering (PTF)
Similarly, PTF enhances the traditional trailing punctuation filter. While lines ending without punctuation are often seen as incomplete or low-quality, the researchers observed that these lines can serve as vital structural indicators, especially when surrounded by complete sentences. PTF categorizes lines as ‘Green’ (with trailing punctuation) or ‘Red’ (without trailing punctuation).
The filter then retains sequences where non-punctuated (‘Red’) lines are enclosed by lines that do end with punctuation (‘Green’). This allows for the retention of short headers or list items that might otherwise be discarded, recognizing their role within the document’s overall structure.
Evaluation and Impact
To test their methods, the researchers trained small language models (around 1 billion parameters) using both English and Korean datasets. They compared the performance of models trained with traditional filtering against those trained with PLD and PTF on various downstream tasks, including multiple-choice benchmarks and generative question-answering tasks like SQuAD v1 and KorQuAD v1.
The results were compelling: the pattern-aware filtering methods consistently improved performance. Notably, they significantly enhanced generative question-answering accuracy, a task where traditional filters often caused a decline. The study also highlighted that filtering decisions should be customized for different languages, as the impact of certain rules can vary between English and Korean.
Interestingly, PLD removed more tokens than traditional deduplication, as it also discards isolated distinctive lines deemed irrelevant. Conversely, PTF removed fewer tokens than its traditional counterpart, as it strategically retains more contextually important lines without punctuation.
Also Read:
- Enhancing Financial Question Answering with Metadata-Driven RAG Architectures
- New Study Explores AI’s Ability to Grade Academic Papers
Conclusion
This research underscores the importance of a more nuanced approach to filtering pretraining corpora for LLMs. By considering the sequential distribution of line-level signals rather than treating each line in isolation, pattern-aware filtering techniques can retain structurally important content that traditional methods might mistakenly discard. This leads to better-trained language models that perform more effectively across a range of downstream tasks, particularly in generative question answering, and demonstrates the value of language-specific tuning in data curation.


