Large Language Models Bring Context to Text Preprocessing

TLDR: This research investigates how Large Language Models (LLMs) can improve text preprocessing tasks like stopword removal, lemmatization, and stemming by considering context, unlike traditional methods. The study found that LLMs can replicate traditional techniques with high accuracy (up to 97% for stopword removal) and significantly boost the performance of downstream machine learning classification tasks by up to 6% in F1 measure across six European languages. While LLMs excel in lemmatization and stopword removal, their performance in stemming is less consistent. The findings suggest LLMs are a promising tool for more context-aware text preparation, especially for languages with limited resources.

Text preprocessing is a foundational step in Natural Language Processing (NLP) that prepares raw text for further analysis. Techniques like removing common words (stopwords), reducing words to their base form (stemming), and converting words to their dictionary form (lemmatization) are crucial for standardizing text and reducing computational load. However, traditional methods for these tasks often operate without considering the specific context of the text, leading to potential inaccuracies.

A recent research paper, “Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing,” explores a novel approach: leveraging Large Language Models (LLMs) to perform these preprocessing tasks. The core idea is that LLMs, with their advanced understanding of linguistic context, can dynamically adapt preprocessing based on the input document, its context, and the specific task at hand, without needing extensive language-specific annotated resources.

The researchers, Marco Braga, Gian Carlo Milanese, and Gabriella Pasi, conducted a comprehensive evaluation comparing LLM-based preprocessing against traditional algorithms across multiple text classification tasks in six European languages: English, French, German, Italian, Portuguese, and Spanish. They utilized several state-of-the-art LLMs, including Gemma-2, Gemma-3, Llama-3.1, Phi-4, and Qwen-2.5, instructing them through carefully designed prompts that included a description of the task, a few examples, the text to be processed, its language, and the context of the downstream task.

LLMs’ Preprocessing Prowess

The study first assessed how effectively LLMs could replicate traditional preprocessing techniques. The findings were impressive: LLMs were capable of performing stopword removal with accuracies reaching 97%, lemmatization up to 82%, and stemming up to 74%. Notably, Gemma-2 consistently performed well in English for stopword removal, lemmatization, and stemming. For non-English languages, the results varied, with Qwen-2.5 showing strong lemmatization in French, Italian, and Spanish, while Gemma-3 and Phi-4 excelled in German and Portuguese lemmatization, respectively.

An interesting observation was that LLMs sometimes removed words not traditionally considered stopwords (e.g., ‘user’ in social media text), suggesting their contextual understanding influenced their decisions. This highlights a key advantage: LLMs can make more nuanced choices based on the text’s domain or specific characteristics.

Impact on Downstream Tasks

Beyond replicating traditional methods, the research investigated whether LLM-based preprocessing improved the performance of downstream tasks. Machine learning algorithms (Decision Tree, Logistic Regression, and Naive Bayes) were trained on texts preprocessed by both traditional and LLM-based methods for various text classification tasks, including emoji prediction, irony detection, hate detection, offensive language identification, sentiment analysis, and news classification.

The results showed a significant improvement: ML algorithms trained on texts preprocessed by LLMs achieved an F1 measure improvement of up to 6% compared to traditional techniques. In English, LLMs outperformed traditional methods in 25 out of 35 examined combinations of datasets and preprocessing tasks. This was particularly evident when stopword removal was combined with lemmatization, indicating LLMs’ ability to dynamically identify task-relevant stopwords and lemmas in a more context-sensitive manner.

However, LLM-based stemming did not consistently outperform traditional stemming, especially in English. This might be because stemming is less context-dependent, and LLMs sometimes showed inconsistencies in generating stems across different texts, which can negatively impact standardized text representations.

Also Read:

Future Directions and Limitations

While promising, the study acknowledges limitations. The evaluation compared LLM outputs against existing Python libraries, which might not fully capture instances where LLMs offer superior contextual understanding (e.g., correctly lemmatizing a complex hashtag). The computational cost of using LLMs is also significantly higher than traditional methods. Nevertheless, the researchers suggest that LLM-based preprocessing is particularly justified for low-resource languages that lack extensive annotated resources for developing traditional lemmatizers and stemmers.

The paper concludes that LLMs offer a powerful new avenue for text preprocessing, especially for lemmatization and context-aware stopword removal across multiple languages. Future work will explore their potential for low-resource languages, where their ability to operate without extensive pre-annotated data could be a game-changer. You can find more details about this research, including their code and prompts, at their publicly available repository. The full research paper is available at this link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Large Language Models Bring Context to Text Preprocessing

LLMs’ Preprocessing Prowess

Impact on Downstream Tasks

Future Directions and Limitations

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates