Making LLMs Safer and More Helpful: Insights from Fine-Tuning OPT-350M

TLDR: A study by Piyush Pant investigated Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach to improve the safety and helpfulness of the OPT-350M language model. Using the Anthropic Helpful-Harmless RLHF dataset, the research found that while SFT alone outperformed DPO, the SFT+DPO combination yielded the best results across all metrics (Harmlessness Rate, Helpfulness Rate, and Combined Alignment Score). The study highlighted that DPO’s standalone performance was impacted by noisy data and training constraints, but it proved highly effective as a complementary step after SFT, suggesting a hybrid approach is optimal for robust LLM alignment.

Large Language Models (LLMs) have become incredibly powerful tools, driving advancements in everything from conversational AI to creative writing. However, ensuring these models are both safe and helpful remains a significant challenge. Unchecked, LLMs can produce content that is incorrect, biased, toxic, or even harmful. This has led researchers to explore various alignment techniques to guide LLMs towards more desirable behaviors.

A recent study, “Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M,” by Piyush Pant, delves into the effectiveness of two prominent alignment methods: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), as well as a combination of both. The research specifically focuses on the OPT-350M language model, a smaller yet representative model, making the findings accessible and relevant for settings with limited computational resources. You can read the full research paper here: Improving LLM Safety and Helpfulness using SFT and DPO.

Understanding Alignment Techniques

Traditionally, Reinforcement Learning from Human Feedback (RLHF) has been a standard for aligning LLMs. However, RLHF can be computationally intensive and complex. To address these limitations, Direct Preference Optimization (DPO) emerged as a promising alternative. DPO directly optimizes a model using ranked human preferences, eliminating the need for an explicit reward model or a complex reinforcement learning loop. It adjusts model parameters to increase the probability of preferred responses over non-preferred ones, offering a simpler implementation with competitive results.

Supervised Fine-Tuning (SFT), on the other hand, is a more foundational technique. It involves directly training models on labeled data, where the model learns to produce specific, desired outputs. While effective for encoding safe and helpful responses, SFT doesn’t inherently handle nuanced preferences or trade-offs between helpfulness and harmlessness as directly as preference-based methods.

The Study’s Approach

The research utilized the Anthropic Helpful-Harmless RLHF dataset, a comprehensive collection designed for alignment training. Four versions of the OPT-350M model were evaluated: the base model (without any alignment tuning), an SFT-trained model, a DPO-trained model, and a model that first underwent SFT and then DPO (SFT+DPO). This multi-phase design allowed for a thorough comparison of each technique’s standalone and combined effectiveness.

To assess performance, the study introduced three key evaluation metrics derived from a dedicated reward model: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS). HmR measures the proportion of harmless responses to harmful prompts, HpR measures helpful responses to benign queries, and CAS provides an overall alignment quality by averaging HmR and HpR.

Key Findings

The results highlighted several important insights:

The base OPT-350M model performed the worst across all metrics, particularly in helpfulness.
The SFT model showed significant improvements in both harmlessness and helpfulness, demonstrating a balanced alignment.
The DPO model, while improving helpfulness, surprisingly showed a slight decrease in harmlessness compared to the base model. This was attributed to factors like noise in the dataset, low-quality base responses, and computational constraints (DPO was trained for only one epoch using Parameter-Efficient Fine-Tuning, while SFT was trained for two full epochs).
Crucially, the combined SFT+DPO model emerged as the top performer. It achieved the highest helpfulness rate and the best Combined Alignment Score, showcasing that DPO can provide significant value when applied as a second-stage optimization on an already SFT-aligned model. This suggests a complementary relationship between the two techniques.

The study also provided a detailed analysis of reward score distributions, showing that SFT and SFT+DPO models produced more consistent and higher-quality responses compared to the base and standalone DPO models. The SFT+DPO model, in particular, exhibited a more reliable and stable alignment performance.

Also Read:

Implications for Future LLM Development

This research underscores the importance of robust alignment strategies, especially for smaller LLMs that are often deployed by startups or research groups with limited resources. While DPO offers a promising path, its effectiveness can be influenced by data quality and training budget. The findings strongly suggest that a hybrid SFT+DPO approach could be the most effective pipeline for enhancing both the safety and helpfulness of language models. Future work will likely focus on addressing data noise, optimizing training durations, and exploring the scalability of these combined methods to even larger LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Making LLMs Safer and More Helpful: Insights from Fine-Tuning OPT-350M

Understanding Alignment Techniques

The Study’s Approach

Key Findings

Implications for Future LLM Development

Gen AI News and Updates

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Google Bolsters AI Agent Safeguards with Enhanced Safety Frameworks

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates