spot_img
HomeResearch & DevelopmentLarge Language Models Bring Context to Text Preprocessing

Large Language Models Bring Context to Text Preprocessing

TLDR: This research investigates how Large Language Models (LLMs) can improve text preprocessing tasks like stopword removal, lemmatization, and stemming by considering context, unlike traditional methods. The study found that LLMs can replicate traditional techniques with high accuracy (up to 97% for stopword removal) and significantly boost the performance of downstream machine learning classification tasks by up to 6% in F1 measure across six European languages. While LLMs excel in lemmatization and stopword removal, their performance in stemming is less consistent. The findings suggest LLMs are a promising tool for more context-aware text preparation, especially for languages with limited resources.

Text preprocessing is a foundational step in Natural Language Processing (NLP) that prepares raw text for further analysis. Techniques like removing common words (stopwords), reducing words to their base form (stemming), and converting words to their dictionary form (lemmatization) are crucial for standardizing text and reducing computational load. However, traditional methods for these tasks often operate without considering the specific context of the text, leading to potential inaccuracies.

A recent research paper, “Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing,” explores a novel approach: leveraging Large Language Models (LLMs) to perform these preprocessing tasks. The core idea is that LLMs, with their advanced understanding of linguistic context, can dynamically adapt preprocessing based on the input document, its context, and the specific task at hand, without needing extensive language-specific annotated resources.

The researchers, Marco Braga, Gian Carlo Milanese, and Gabriella Pasi, conducted a comprehensive evaluation comparing LLM-based preprocessing against traditional algorithms across multiple text classification tasks in six European languages: English, French, German, Italian, Portuguese, and Spanish. They utilized several state-of-the-art LLMs, including Gemma-2, Gemma-3, Llama-3.1, Phi-4, and Qwen-2.5, instructing them through carefully designed prompts that included a description of the task, a few examples, the text to be processed, its language, and the context of the downstream task.

LLMs’ Preprocessing Prowess

The study first assessed how effectively LLMs could replicate traditional preprocessing techniques. The findings were impressive: LLMs were capable of performing stopword removal with accuracies reaching 97%, lemmatization up to 82%, and stemming up to 74%. Notably, Gemma-2 consistently performed well in English for stopword removal, lemmatization, and stemming. For non-English languages, the results varied, with Qwen-2.5 showing strong lemmatization in French, Italian, and Spanish, while Gemma-3 and Phi-4 excelled in German and Portuguese lemmatization, respectively.

An interesting observation was that LLMs sometimes removed words not traditionally considered stopwords (e.g., ‘user’ in social media text), suggesting their contextual understanding influenced their decisions. This highlights a key advantage: LLMs can make more nuanced choices based on the text’s domain or specific characteristics.

Impact on Downstream Tasks

Beyond replicating traditional methods, the research investigated whether LLM-based preprocessing improved the performance of downstream tasks. Machine learning algorithms (Decision Tree, Logistic Regression, and Naive Bayes) were trained on texts preprocessed by both traditional and LLM-based methods for various text classification tasks, including emoji prediction, irony detection, hate detection, offensive language identification, sentiment analysis, and news classification.

The results showed a significant improvement: ML algorithms trained on texts preprocessed by LLMs achieved an F1 measure improvement of up to 6% compared to traditional techniques. In English, LLMs outperformed traditional methods in 25 out of 35 examined combinations of datasets and preprocessing tasks. This was particularly evident when stopword removal was combined with lemmatization, indicating LLMs’ ability to dynamically identify task-relevant stopwords and lemmas in a more context-sensitive manner.

However, LLM-based stemming did not consistently outperform traditional stemming, especially in English. This might be because stemming is less context-dependent, and LLMs sometimes showed inconsistencies in generating stems across different texts, which can negatively impact standardized text representations.

Also Read:

Future Directions and Limitations

While promising, the study acknowledges limitations. The evaluation compared LLM outputs against existing Python libraries, which might not fully capture instances where LLMs offer superior contextual understanding (e.g., correctly lemmatizing a complex hashtag). The computational cost of using LLMs is also significantly higher than traditional methods. Nevertheless, the researchers suggest that LLM-based preprocessing is particularly justified for low-resource languages that lack extensive annotated resources for developing traditional lemmatizers and stemmers.

The paper concludes that LLMs offer a powerful new avenue for text preprocessing, especially for lemmatization and context-aware stopword removal across multiple languages. Future work will explore their potential for low-resource languages, where their ability to operate without extensive pre-annotated data could be a game-changer. You can find more details about this research, including their code and prompts, at their publicly available repository. The full research paper is available at this link.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -