TLDR: A new research paper challenges the common belief that fine-tuning inevitably harms LLM safety. Instead, it argues that poor optimization choices are often the cause. By carefully adjusting training hyperparameters like learning rate and batch size, and by introducing an Exponential Moving Average (EMA) momentum technique, the researchers significantly reduced unsafe model responses (from 16% to as low as 3%) while maintaining performance. This approach avoids the need for additional safety datasets and offers practical guidelines for safer LLM adaptation.
Large Language Models (LLMs) have become incredibly powerful, adapting to many tasks through a process called fine-tuning. While this customization boosts performance for specific applications, it often raises concerns about safety. It’s commonly believed that fine-tuning, even with harmless datasets, can unintentionally make LLMs generate harmful responses, requiring extra safety measures.
However, new research challenges this idea. A paper titled “Rethinking Safety in LLM Fine-tuning: An Optimization Perspective” suggests that safety problems aren’t an unavoidable trade-off, but rather often stem from poor optimization choices during the training process. By carefully selecting key training settings, such as the learning rate, batch size, and gradient steps, the researchers found they could significantly reduce unsafe model responses while maintaining the model’s overall usefulness.
The Role of Optimization
The study demonstrates that by simply adjusting these training hyperparameters, the rate of unsafe responses to harmful prompts could be dramatically cut, for example, from 16% down to around 5%. This indicates that the ‘catastrophic forgetting’ of safety knowledge during fine-tuning is far less severe than previously thought, especially when the fine-tuning process is optimized effectively.
The researchers propose that stable learning is crucial for preserving the safety guidelines that LLMs learn during their initial pre-training. They observed that aggressive parameter updates, often caused by certain hyperparameter choices, can push the model out of its ‘safety basin’ – a stable region in the model’s internal settings that retains safety knowledge.
Also Read:
- Optimizing Data Mixtures for Fine-Tuning Large Language Models
- Unpacking Prompt Sensitivity: A Deep Dive into LLM Robustness
Introducing EMA Momentum
To address this, the paper introduces a simple yet effective technique: Exponential Moving Average (EMA) momentum in the parameter space. This method leverages the original pre-trained model’s knowledge during fine-tuning. By smoothing the parameter updates, EMA prevents abrupt shifts that could compromise safety. Surprisingly, this EMA-based approach achieved an even lower attack success rate, approximately 3%, without needing any additional safety-specific data.
The experiments were conducted on popular LLM families like Llama, across various datasets such as Dolly, Alpaca, and ORCA. The results consistently showed that safety issues during fine-tuning can largely be avoided through proper optimization techniques, and further improved by the EMA method. This approach outperformed existing methods that typically require extra safety datasets, offering practical guidance for maintaining both model performance and safety during adaptation.
The research highlights that the utility loss landscape (how well the model performs its task) is generally smoother and wider than the safety loss landscape (how well it avoids harmful responses). This means that safety knowledge is more sensitive to suboptimal optimization. The EMA method helps the model navigate this complex landscape more stably, preserving safety without sacrificing performance.
This work provides valuable insights for the development of safer and more reliable LLMs, suggesting that a focus on optimization strategies can be a powerful tool in ensuring AI safety. You can read the full research paper here: Rethinking Safety in LLM Fine-tuning: An Optimization Perspective.


