TLDR: A new research paper investigates whether biased decisions in large language models (LLMs) stem from biased internal thoughts. The study found that bias in an LLM’s thinking steps is often not strongly correlated with its output bias, unlike humans. It also showed that while Chain-of-Thought prompting has a varied impact on fairness, injecting unbiased thoughts into the model’s reasoning process can effectively reduce output bias, offering a new mitigation strategy.
Large language models (LLMs) have revolutionized many aspects of natural language processing, showcasing remarkable capabilities in various tasks. However, their widespread deployment faces a significant hurdle: the presence of social biases. These biases, based on factors like gender, race, socio-economic status, and sexual orientation, can lead to unfair or discriminatory responses, raising serious ethical concerns.
A recent study delves into a fascinating question: Do biased models actually have biased thoughts? This research explores the internal reasoning processes of LLMs, specifically focusing on how Chain-of-Thought (CoT) prompting affects fairness. CoT prompting is a technique where models are asked to “think step-by-step” before providing a final answer, offering insights into their decision-making process.
The paper, titled “Do Biased Models Have Biased Thoughts?”, conducted experiments on five popular large language models, analyzing 11 different types of biases using established fairness metrics. The goal was to quantify bias not just in the models’ final outputs, but also in their intermediate “thoughts” or reasoning steps.
Unpacking the Findings: Biased Thoughts vs. Biased Outputs
One of the most surprising findings of the study is that the bias observed in the models’ thinking steps is not strongly correlated with the bias in their final outputs. In most cases, the correlation was less than 0.6, with high statistical significance. This suggests a crucial difference between how humans and these AI models exhibit bias. For humans, biased decisions often stem from biased thought processes. However, for the tested LLMs, a biased decision doesn’t necessarily mean their internal reasoning was also biased.
To arrive at this conclusion, the researchers had to first figure out how to measure bias in these internal thoughts. They proposed six different methods, including repurposing existing techniques and introducing a novel approach called Bias Reasoning Analysis using Information Norms (BRAIN). These methods assess thought bias using various signals, such as model probabilities, LLM-as-a-judge evaluations, natural language inference, and semantic similarity. Both BRAIN and the LLM-as-a-judge method proved to be effective in detecting bias within the models’ thoughts.
Also Read:
- Measuring Self-Favoritism in Large Language Model Evaluations
- Decoding Emotional Intelligence in AI: A Cognitive Appraisal Perspective
The Impact of Step-by-Step Thinking and Thought Injection
The study also investigated whether thinking in a step-by-step manner (CoT prompting) consistently leads to fairer outcomes. The results showed that the impact of CoT prompting on fairness is highly model-dependent. For some models, it improved fairness, while for others, it either had no significant effect or even increased bias. This highlights that there isn’t a one-size-fits-all solution when it comes to using CoT for bias mitigation.
Perhaps the most promising finding relates to “thought injection.” The researchers demonstrated that actively injecting unbiased thoughts into the model’s prompt significantly reduced bias in the final output. Conversely, injecting biased thoughts led to increased output bias. This opens up an exciting avenue for future research: using carefully crafted, unbiased internal reasoning as an effective and efficient method to mitigate biases in LLMs. This approach could guide models towards fairer decisions by influencing their internal reasoning process.
In conclusion, while LLMs continue to advance, understanding and mitigating their biases remains paramount. This research provides valuable insights into the complex relationship between a model’s internal thoughts and its external behavior, suggesting that addressing bias might involve not just refining outputs, but also carefully shaping the underlying reasoning. For more details, you can read the full research paper here.


