TLDR: A study by Piyush Pant investigated Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach to improve the safety and helpfulness of the OPT-350M language model. Using the Anthropic Helpful-Harmless RLHF dataset, the research found that while SFT alone outperformed DPO, the SFT+DPO combination yielded the best results across all metrics (Harmlessness Rate, Helpfulness Rate, and Combined Alignment Score). The study highlighted that DPO’s standalone performance was impacted by noisy data and training constraints, but it proved highly effective as a complementary step after SFT, suggesting a hybrid approach is optimal for robust LLM alignment.
Large Language Models (LLMs) have become incredibly powerful tools, driving advancements in everything from conversational AI to creative writing. However, ensuring these models are both safe and helpful remains a significant challenge. Unchecked, LLMs can produce content that is incorrect, biased, toxic, or even harmful. This has led researchers to explore various alignment techniques to guide LLMs towards more desirable behaviors.
A recent study, “Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M,” by Piyush Pant, delves into the effectiveness of two prominent alignment methods: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), as well as a combination of both. The research specifically focuses on the OPT-350M language model, a smaller yet representative model, making the findings accessible and relevant for settings with limited computational resources. You can read the full research paper here: Improving LLM Safety and Helpfulness using SFT and DPO.
Understanding Alignment Techniques
Traditionally, Reinforcement Learning from Human Feedback (RLHF) has been a standard for aligning LLMs. However, RLHF can be computationally intensive and complex. To address these limitations, Direct Preference Optimization (DPO) emerged as a promising alternative. DPO directly optimizes a model using ranked human preferences, eliminating the need for an explicit reward model or a complex reinforcement learning loop. It adjusts model parameters to increase the probability of preferred responses over non-preferred ones, offering a simpler implementation with competitive results.
Supervised Fine-Tuning (SFT), on the other hand, is a more foundational technique. It involves directly training models on labeled data, where the model learns to produce specific, desired outputs. While effective for encoding safe and helpful responses, SFT doesn’t inherently handle nuanced preferences or trade-offs between helpfulness and harmlessness as directly as preference-based methods.
The Study’s Approach
The research utilized the Anthropic Helpful-Harmless RLHF dataset, a comprehensive collection designed for alignment training. Four versions of the OPT-350M model were evaluated: the base model (without any alignment tuning), an SFT-trained model, a DPO-trained model, and a model that first underwent SFT and then DPO (SFT+DPO). This multi-phase design allowed for a thorough comparison of each technique’s standalone and combined effectiveness.
To assess performance, the study introduced three key evaluation metrics derived from a dedicated reward model: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS). HmR measures the proportion of harmless responses to harmful prompts, HpR measures helpful responses to benign queries, and CAS provides an overall alignment quality by averaging HmR and HpR.
Key Findings
The results highlighted several important insights:
- The base OPT-350M model performed the worst across all metrics, particularly in helpfulness.
- The SFT model showed significant improvements in both harmlessness and helpfulness, demonstrating a balanced alignment.
- The DPO model, while improving helpfulness, surprisingly showed a slight decrease in harmlessness compared to the base model. This was attributed to factors like noise in the dataset, low-quality base responses, and computational constraints (DPO was trained for only one epoch using Parameter-Efficient Fine-Tuning, while SFT was trained for two full epochs).
- Crucially, the combined SFT+DPO model emerged as the top performer. It achieved the highest helpfulness rate and the best Combined Alignment Score, showcasing that DPO can provide significant value when applied as a second-stage optimization on an already SFT-aligned model. This suggests a complementary relationship between the two techniques.
The study also provided a detailed analysis of reward score distributions, showing that SFT and SFT+DPO models produced more consistent and higher-quality responses compared to the base and standalone DPO models. The SFT+DPO model, in particular, exhibited a more reliable and stable alignment performance.
Also Read:
- ICON 2: A New Path to Efficient LLM Alignment with Self-Generated Data
- Reinforcement Learning Unlocks Advanced Reasoning in Large Language Models
Implications for Future LLM Development
This research underscores the importance of robust alignment strategies, especially for smaller LLMs that are often deployed by startups or research groups with limited resources. While DPO offers a promising path, its effectiveness can be influenced by data quality and training budget. The findings strongly suggest that a hybrid SFT+DPO approach could be the most effective pipeline for enhancing both the safety and helpfulness of language models. Future work will likely focus on addressing data noise, optimizing training durations, and exploring the scalability of these combined methods to even larger LLMs.


