TLDR: A new method called Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection has been developed to make large language models (LLMs) safer after fine-tuning. It works by precisely identifying and adjusting specific ‘safety neurons’ in the model’s critical layers, projecting them towards a ‘safe’ direction without extensive retraining. This approach significantly reduces harmful outputs and attack success rates with minimal changes to the model, while preserving its original utility. FGSN also allows for continuous adaptation to new safety concerns, making LLMs more robust over time.
Large Language Models (LLMs) have become incredibly powerful, driving advancements in various fields from language understanding to healthcare. However, their widespread use also brings growing safety concerns, especially when these models are fine-tuned for specific tasks. Fine-tuning, even with seemingly harmless data, can inadvertently disrupt the LLM’s original safety settings, making it vulnerable to generating harmful or undesirable content.
Existing defense strategies often fall short. Some methods involve adding perturbations during training, which can be unstable across different safety scenarios. Others integrate safety data during fine-tuning, leading to additional training costs. Post-fine-tuning defenses, while not requiring retraining, often rely on coarse-grained adjustments to entire layers, which can limit their effectiveness in balancing safety with the model’s overall utility.
Introducing Fine-Grained Safety Neurons (FGSN)
To address these challenges, researchers have proposed a novel method called Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection. This approach aims to reduce fine-tuning safety risks by precisely identifying and adjusting specific parts of the LLM, rather than making broad changes.
The core idea behind FGSN is to pinpoint the exact ‘safety neurons’ within the model that are responsible for handling harmful content. It does this by first identifying ‘safety-critical layers’ – specific sections of the LLM (like layers 10-15 in models such as LLaMA) that play a crucial role in distinguishing between benign and harmful prompts. Within these critical layers, FGSN then precisely locates individual neurons that are highly active when processing harmful inputs, while minimizing interference with neurons important for general tasks.
How FGSN Works
Unlike traditional methods that might require extensive retraining, FGSN employs a ‘training-free’ approach. Once the fine-grained safety neurons are identified, their parameters are ‘projected’ onto a ‘safety direction’. This direction is derived by comparing an unaligned base model with a human-aligned safety model, essentially guiding the identified neurons towards safer behavior. This projection is efficient and requires minimal modifications to the model’s parameters.
A significant advantage of FGSN is its ‘continual projection’ capability. As new safety concerns emerge, the method can adapt. It ensures that neurons already adjusted for previous safety dimensions are not re-modified, while newly identified safety neurons for the current concern are projected. This allows the LLM to continuously improve its safety alignment without ‘forgetting’ what it has learned previously, making it robust against evolving threats.
Also Read:
- NeuronTune: A Precise Approach to Balancing Safety and Usefulness in Large Language Models
- Steering Large Language Models Away From Bias: A New Approach to Safer AI
Promising Results
Extensive experiments were conducted on popular LLMs like Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct. FGSN consistently achieved significantly lower harmfulness scores and attack success rates compared to other defense methods. For instance, on Alpaca-finetuned models, FGSN reduced harmfulness scores to near the minimum (e.g., 1.02 on Llama-3-8B) and achieved the lowest attack success rates (14%).
Crucially, FGSN accomplished these safety improvements with minimal changes to the model’s parameters (as low as 4.67% for Qwen-2.5-7B), ensuring that the model’s original utility and performance on tasks like semantic question answering and mathematical reasoning were preserved, and in some cases, even slightly improved. The continual safety experiments demonstrated FGSN’s strong generalization across different safety dimensions (e.g., animal abuse, child abuse, terrorism), showing that the model could adapt to new risks with progressively fewer parameter modifications.
In conclusion, Fine-Grained Safety Neurons with Training-Free Continual Projection offers a precise, efficient, and adaptable framework for enhancing the safety of fine-tuned LLMs. By focusing on specific safety neurons and enabling continuous adaptation, this method paves the way for more robust and reliable large language models in various applications. You can read the full research paper here.


