TLDR: A new research paper introduces a multidimensional framework to understand and control harmful content in LLMs. By creating “linear probes” for 55 harmful subconcepts, they found these concepts form a low-rank “harmfulness subspace.” Steering the model’s internal states along the dominant direction of this subspace effectively eliminates harmful responses while maintaining model utility.
Large language models (LLMs) have become an integral part of our daily lives, offering incredible capabilities to average users. However, this increased access and power also bring risks, particularly concerning the generation of harmful content. Previous efforts to curb such behavior have focused on methods like direct preference optimization and safety fine-tuning. A new research paper, titled “Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing,” introduces a novel approach to understanding and mitigating harmfulness by delving into the internal workings of these models.
The researchers, McNair Shah, Saleena Angeline, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu, Vasu Sharma, Sean O’Brien, and Will Cai, propose a multidimensional framework to probe and steer harmful content within LLM internals. Their core idea revolves around identifying 55 distinct harmfulness subconcepts, such as racial hate, employment scams, and weapon-making. For each of these subconcepts, they developed a ‘linear probe,’ which essentially learns a specific direction in the model’s activation space. These 55 directions collectively form what they call a ‘harmfulness subspace.’
A key finding of their study is that this harmfulness subspace is remarkably ‘low-rank.’ This means that despite having 55 individual directions, the harmfulness can largely be represented by a much smaller number of underlying components, with a single ‘dominant direction’ capturing most of its structure. This discovery is crucial because it suggests that controlling harmfulness might not require manipulating all 55 individual directions, but rather focusing on this dominant one.
To test their theory, the researchers performed two main types of interventions: ‘ablation’ and ‘steering.’ Ablation involves removing the harmfulness subspace or its dominant direction from the model’s internal states, effectively trying to erase the model’s capacity for harm. Steering, on the other hand, involves actively pushing the model’s internal states away from the dominant harmful direction. They conducted these experiments on LLAMA-3.1-8B-INSTRUCT and replicated some findings on QWEN-2-7B-INSTRUCT, using various datasets to evaluate both safety and utility.
The results were promising. They found that steering the model in the dominant direction of the harmfulness subspace allowed for a near elimination of harmful responses on a challenging ‘jailbreak’ dataset, which is designed to elicit unsafe outputs. Crucially, this significant improvement in safety came with only a minor decrease in the model’s overall utility, as measured by its performance on general knowledge tasks. Ablation also showed some improvement in safety, but steering proved to be more effective in achieving a strong balance between safety and utility.
The study also explored ‘token visualizations,’ which help understand which specific words or phrases trigger the harmfulness probes. While many triggers were contextually relevant to harmful subcategories, the researchers noted that some semantically unrelated or benign words also received high scores, highlighting the complexities and challenges in interpreting model behavior at a granular level. These ‘misfires’ underscore the need for careful human oversight in AI safety systems.
Also Read:
- Navigating the Double-Edged Sword of LLM Dimensionality for Enhanced Safety
- Self-Aware AI: Improving Safety in Vision-Language Models
While the findings offer valuable insights and practical tools for enhancing LLM safety, the authors acknowledge several limitations. The generalizability of their results to other models of different scales needs further investigation. The 55 harmfulness subcategories, while extensive, may not cover all possible forms of harmful content. Additionally, the definition of ‘utility’ used in their evaluation is specific and may not encompass all aspects of a model’s usefulness. Nevertheless, this work significantly advances the understanding of how harmfulness is represented within LLMs and provides a scalable approach to auditing and hardening future generations of language models. For more details, you can refer to the full research paper here.


