TLDR: A study using activation patching on the LLaMA 3.2 1B Instruct model reveals how language models process personas. It found that early Multi-Layer Perceptron (MLP) layers encode persona-specific semantic information, which is then utilized by middle Multi-Head Attention (MHA) layers to shape the model’s output. The research also identified specific attention heads that disproportionately focus on racial and color-based identities, offering insights into the internal mechanisms behind persona-driven reasoning and potential biases.
Large language models (LLMs) have shown an impressive ability to adopt various personas, allowing them to generate responses that are sensitive to context and tailored to specific roles. However, this ability isn’t without its complexities; assigning a persona can sometimes influence the model’s reasoning on objective tasks and, in certain instances, even amplify existing social biases.
A recent study delves into the internal workings of these models to understand the causal mechanisms behind persona-driven behavior. Using a technique called activation patching, researchers examined how key components within a pre-trained language model encode and utilize persona-specific information.
Unpacking the Model’s Layers
The study focused on two primary components: Multi-Layer Perceptron (MLP) layers and Multi-Head Attention (MHA) layers, along with individual attention heads. Contrary to previous assumptions that early MLP layers primarily handle syntactic structure, this research reveals that these layers also process the semantic content of the input. Essentially, they transform persona tokens (like ‘Asian’ or ‘good’ in ‘Asian student’ or ‘good student’) into richer, more meaningful representations.
These enriched representations are then passed on to the middle MHA layers. It’s these MHA layers that subsequently use this persona-specific information to shape the model’s final output. The findings suggest a clear interaction: early MLP layers build the persona’s semantic foundation, and middle MHA layers leverage it to influence the model’s responses.
Identifying Bias in Attention
A particularly significant discovery was the identification of specific attention heads that disproportionately focus on racial and color-based identities. While the study found that personas with negative attributes (like ‘bad student’) consistently led to worse performance, the patterns for racial or color-coded personas were less clear-cut, indicating a complex interplay of factors.
The researchers employed a method called “de-noising” activation patching, where activations from a ‘corrupted’ run (e.g., with a biased persona) are replaced with those from a ‘clean’ run (e.g., with a neutral persona). This allowed them to pinpoint which components were responsible for changes in the model’s output. They observed that patching early MLP layers (layers 1-3) and middle MHA layers (layers 9-11) had the most consistent impact across different persona pairings.
Further investigation showed that even patching only the identity token position in the very first MLP layer could produce an effect nearly equivalent to patching all token positions, highlighting the critical role of these initial layers in establishing persona semantics. When activations related to racial or color-based personas were replaced with those from positive or negative attributed personas, the attention of heads previously focused on identity tokens significantly decreased.
Also Read:
- Inside LLMs: Why Some Languages Get Less Attention
- Decoding Chain-of-Thought: Information Flow in Language Models
Implications for Understanding LLMs
This research, conducted on the LLaMA 3.2 1B Instruct model using the MMLU dataset, provides preliminary insights into the subtle yet significant functions of various model components. It challenges existing beliefs about how LLMs process information and lays groundwork for future efforts to mitigate deep-seated biases within these systems. By understanding where and how persona-driven behavior originates, researchers can work towards building more robust and fair AI models.
For more detailed information, you can refer to the full research paper: Dissecting Persona-Driven Reasoning in Language Models via Activation Patching.


