TLDR: This paper investigates how traditional data augmentation methods like backtranslation and paraphrasing, when powered by modern Large Language Models (LLMs) such as GPT, compare to purely generative methods for improving emotion classification. Using the GoEmotions dataset, the study found that all augmentation techniques significantly boost classification performance, especially for underrepresented emotion categories. Backtranslation, particularly with DeepL, yielded the most substantial improvements in F1-macro scores for augmented classes, demonstrating that leveraging LLMs with traditional methods can be highly effective in addressing data scarcity and class imbalance in NLP tasks.
In the rapidly evolving landscape of Artificial Intelligence, particularly in Natural Language Processing (NLP), a persistent challenge remains: data scarcity and class imbalance. Many specialized machine learning tasks struggle to perform optimally without vast, diverse datasets. This often leads to models that are either under-trained or biased towards more frequently occurring data points.
Understanding the Challenge: Data Scarcity in NLP
Deep learning models, the backbone of modern AI, thrive on large volumes of high-quality input data. For NLP, this means extensive textual data. However, numerous domain-specific applications lack such abundant resources. Data augmentation (DA) emerges as a crucial technique to tackle this problem. While DA has seen significant success in computer vision and audio processing, its application to text data is more complex due to the nuanced nature of language. The advent of Large Language Models (LLMs) like GPT has opened new avenues for text augmentation, offering powerful generative capabilities to create diverse and coherent text samples.
The Study’s Approach: Comparing Data Augmentation Methods
A recent research paper, titled “Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification,” systematically explores various data augmentation methods for NLP. The core objective of this study was to assess whether traditional techniques, specifically paraphrasing and backtranslation, can be effectively enhanced by the new generation of LLMs to achieve performance comparable to, or even surpassing, purely generative methods. The researchers conducted a series of experiments, comparing four distinct approaches to data augmentation across multiple setups, evaluating both the quality of the generated data and its impact on classification performance. The key finding suggests that backtranslation and paraphrasing, when leveraged by LLMs, can indeed yield comparable or even superior results to zero-shot and few-shot generation of examples.
The GoEmotions Dataset: A Case Study
For their experiments, the researchers chose the GoEmotions dataset, an extensive collection of textual comments labeled with specific emotions. This dataset, collaboratively developed by Google and Amazon, is derived from Reddit comments between 2005 and 2019 and is meticulously human-annotated, ensuring high accuracy. Despite its size and quality, GoEmotions presents a notable challenge: class imbalance. It features 27 emotional labels, but some categories are severely underrepresented. The study focused on augmenting the five least represented emotions: embarrassment, nervousness, relief, pride, and grief. This imbalance makes GoEmotions an ideal candidate for evaluating data augmentation techniques.
The Data Augmentation Techniques Explored
The study investigated four primary data augmentation techniques, focusing on novel approaches that utilize LLMs:
1. Oversampling: The Baseline
Oversampling is a straightforward method to address class imbalance by artificially increasing the instances of minority classes. In its simplest form, it involves duplicating existing examples. This approach served as a baseline in the study, providing a reference point for evaluating linguistic diversity and semantic similarity, as it introduces no new linguistic variations but maintains maximum semantic fidelity.
2. Paraphrasing: Rewording for Diversity
Paraphrasing involves rewriting or rephrasing text while preserving its original semantic meaning but altering its lexical and syntactic structures. This method is highly effective for data augmentation in NLP because it introduces lexical diversity without changing the core message. The study utilized GPT-3.5 and GPT-4 for paraphrasing, experimenting with different prompt configurations to generate single or multiple paraphrases per sample.
3. Zero-shot and Few-shot Generation: Learning with Limited Examples
Zero-shot learning (ZSL) and few-shot learning (FSL) are machine learning techniques designed to enhance a model’s generalizability, especially when data is limited. ZSL requires no examples, while FSL uses a small number of examples to enable a trained deep learning model to perform new tasks or infer unseen classes. These methods are resource-effective and benefit from in-context learning when applied with LLMs. The study used GPT-3.5 and GPT-4 in both zero-shot and 5-shot scenarios.
4. Backtranslation: The Round-Trip Approach
Backtranslation (BT) is a specific form of paraphrasing where a source text is translated into another language and then translated back into the original language. The resulting text often exhibits lexical variations while maintaining semantic similarity. This process leverages the unique structures and linguistic features of different languages, and even imperfect translations can introduce valuable diversity. The researchers conducted backtranslation using multiple models, including DeepL, GPT-3.5, GPT-4, GPT-4-turbo, and the MarianMT model family, translating through various foreign languages like Russian, Polish, Chinese, and Spanish to maximize diversity.
Evaluating the Impact: Diversity, Fidelity, and Classification Performance
The evaluation of the augmented data focused on three key aspects: linguistic diversity, semantic fidelity, and the improvement in classification performance. Linguistic diversity was measured using metrics like word count ratio, Jaccard dissimilarity, information entropy, and Type Token Ratios (TTR). Semantic fidelity, crucial for ensuring the meaning and label are preserved, was assessed using cosine similarity for embeddings and BERTScore. Finally, the impact on classification was measured by fine-tuning two popular transformer models, LaBSE and DistilBERT, and comparing their F1-macro scores on both the entire dataset and the augmented classes.
Key Findings: What the Experiments Revealed
The experiments yielded several important insights. While paraphrasing methods, particularly with GPT-3.5, generated longer sentences and higher TTR ratios, GPT-4 showed better performance in introducing lexical diversity based on Jaccard dissimilarity. Backtranslation techniques generally introduced less diversity, except for MarianMT models and GPT-4-turbo, which showed high diversity. In terms of semantic fidelity, backtranslation methods generally outperformed paraphrasing, with DeepL and GPT-4 achieving the highest results.
Crucially, all data augmentation methods led to a visible increase in classification performance for the augmented classes and the entire dataset, without significantly decreasing performance in non-augmented classes. Even simple oversampling demonstrated substantial gains, improving DistilBERT’s performance in augmented classes by over 76% and the entire dataset by 5%. Paraphrasing methods showed increases of up to 84% in augmented classes, while generative approaches like 5-shot GPT-3.5 achieved a 77.5% increase with DistilBERT. However, backtranslation consistently produced the best overall results. Specifically, backtranslation using DeepL with DistilBERT led to an impressive 121% increase in F1-macro for the augmented classes and a 7.7% increase across the entire dataset.
The study concluded that while modern LLMs can effectively enhance traditional data augmentation approaches, the overall performance heavily depends on the specific models used for augmentation and fine-tuned configuration elements, such as prompt design and example selection.
Looking Ahead: Limitations and Future Directions
The researchers acknowledge that while data augmentation with LLMs is effective for improving performance and mitigating class imbalance, further experiments are needed to confirm these findings across different models and datasets. Future research should explore various model parameters, evaluate generated text using LLM-as-a-judge approaches, and investigate specific backtranslation setups and multiple prompts for paraphrasing and generation. Addressing potential cultural and global biases resulting from data sourcing (e.g., Reddit) is also a critical area for future work.
Also Read:
- Synthetic Emotions: How AI is Creating Diverse Text for Emotion Recognition
- Bridging the Knowledge Gap: How Large Language Models Enhance Natural Language Inference
Conclusion: The Power of Augmentation in the LLM Era
In summary, the study demonstrates that all explored data augmentation methods, when utilizing LLMs, produce semantically similar yet distinct samples. While paraphrasing offered greater lexical diversity, backtranslation generally maintained better semantic similarity. All methods successfully improved classification results, with backtranslation, particularly using the DeepL model, yielding the most significant gains in F1-macro scores for augmented classes. The research highlights the importance of considering the costs associated with data augmentation, including API usage fees, fine-tuning time, and computational resources. Ultimately, the findings underscore that modern language models can effectively leverage traditional data augmentation techniques to significantly enhance performance in NLP tasks, especially for addressing data scarcity and class imbalance.
For more in-depth details, you can access the full research paper here.


