TLDR: This research evaluates how Large Language Models (LLMs) like Gemini, GPT-4o, DeepSeek, and Groq perform in transforming abusive social media text into non-abusive versions while preserving the original meaning and sentiment. The study found that while all LLMs effectively reduce abusive content, Groq tends to significantly rephrase text with positive language, sometimes altering the original context. In contrast, GPT-4o and DeepSeek maintain closer semantic similarity to the original text. The paper highlights the potential of LLMs for online content moderation and discusses factors like API safety settings influencing their performance, along with limitations and future research directions.
In today’s digital age, social media platforms have become central to communication, allowing users to freely express themselves. However, this freedom often comes with the challenge of managing harmful content, including cyberbullying, harassment, and hate speech. Traditional methods for detecting and moderating such abusive text have struggled with the complexity and nuances of human language. This has led researchers to explore the capabilities of Large Language Models (LLMs) in transforming abusive content into non-abusive versions while preserving the original message and sentiment.
A recent study, titled “Abusive text transformation using LLMs,” delves into this critical area, evaluating the performance of several state-of-the-art LLMs: Gemini, GPT-4o, DeepSeek, and Groq. The core objective was to assess their ability to identify abusive text, particularly from tweets and reviews containing hate speech and swear words, and then transform them into clean, appropriate content without losing the original intent or sentiment. The researchers utilized a comprehensive framework involving data acquisition, preprocessing, LLM API configuration, review transformation, and rigorous sentiment and semantic analysis.
The methodology involved using two datasets of abusive text, one from the Indian Institute of Technology Guwahati (IIT) and another consisting of Twitter (X) data. These datasets contained various forms of abuse, including hate speech, swear words, personal attacks, and racist remarks. To evaluate the transformations, BERT-based models, including Hate-BERT for abuse detection and SenWave-BERT for sentiment analysis, were employed. Semantic analysis was performed using MP-Net to measure how semantically similar the transformed texts were to their originals.
Key Findings from the LLM Evaluation
The study revealed distinct characteristics among the LLMs in their approach to text transformation. Groq, for instance, consistently produced the longest outputs, often adding positive phrases and encouraging respectful dialogue. This tendency sometimes led to a significant alteration of the original context, making its transformed text semantically less similar to the original compared to other models. This was also reflected in Groq having the lowest successful transformation rate among the tested models, likely due to its stricter internal guidelines that could not be easily adjusted by the researchers.
In contrast, GPT-4o and DeepSeek demonstrated remarkably similar results. Both models tended to preserve more of the original context and phrasing, resulting in higher semantic similarity to the input text. They were also highly successful in transforming abusive content, with GPT-4o achieving the highest success rate. Gemini struck a balance, offering polite rewrites without as much added commentary as Groq, and showing a higher transformation success rate than Groq, partly attributed to its configurable safety settings that allowed for more relaxed content moderation.
Across all models, the transformation process significantly reduced the presence of hateful words, as confirmed by keyword searches. Sentiment analysis showed a dramatic shift from negative sentiments like “annoyed” in the original texts to more “optimistic” sentiments in the transformed versions. Groq, in particular, showed the most substantial increase in optimistic transformations. While all models were effective in detoxifying text, the study highlighted that Groq occasionally altered the semantic meaning through excessive positive phrasing, unlike the other models.
Also Read:
- Evaluating Large Language Models for Argument Classification: A Deep Dive into Performance and Pitfalls
- ETTA: Unveiling a New Method to Circumvent LLM Safety Measures Through Embedding Manipulation
Challenges and Future Directions
Despite the promising results, the researchers acknowledged several limitations. The datasets often contained incomplete sentences, slang, and emojis, which posed challenges for LLMs in accurately grasping context. Sarcasm, a common element in social media, was also a significant factor not fully addressed. Furthermore, the study focused solely on text, omitting multimodal inputs like images or links that are integral to user experience on platforms like Twitter. API limitations also restricted the analysis to a subset of tweets for each model.
Future work could involve evaluating other LLMs like Claude and Mistral, using more diverse datasets (e.g., hate speech in different languages), and developing models that can better understand sarcasm, emoji meanings, and slang. Exploring multimodal inputs to create context-aware models would also provide a deeper understanding of user behavior, helping platforms balance online safety with freedom of expression. This research underscores the growing potential of LLMs in fostering healthier digital environments by effectively transforming abusive content. For more details, you can refer to the full research paper here.


