TLDR: A new research paper explores how supervised finetuning can significantly enhance large language models’ (LLMs) ability to communicate their uncertainty. The study found that while single-task training improves specific metacognitive skills (like single-question confidence or pairwise comparison), these improvements don’t easily transfer between tasks. However, multitask finetuning, which trains models on both types of uncertainty communication simultaneously, leads to broader and more generalizable gains in calibration (stated confidence matching accuracy) and discrimination (distinguishing correct from incorrect answers) across various knowledge domains, without affecting overall accuracy. This work highlights the importance of diverse training for developing more reliable and transparent AI systems.
Large language models (LLMs) are becoming increasingly integrated into critical decision-making processes across various fields, from education and business to law and medicine. While these powerful AIs can generate impressive responses, a significant challenge remains: they often present information with high confidence, even when it’s incorrect. This can lead users to unknowingly act on erroneous outputs, with potentially serious consequences. Imagine an AI giving medical advice without indicating it’s unsure, or a legal brief with a confident but flawed argument. This is where the concept of ‘metacognition’ for LLMs comes into play – the ability of an AI to monitor its own knowledge and reasoning processes, essentially knowing what it knows and, more importantly, what it doesn’t.
A recent research paper, “Improving Metacognition and Uncertainty Communication in Language Models”, delves into this crucial area. Authored by Mark Steyvers, Catarina Belem, and Padhraic Smyth from the University of California, Irvine, this study investigates whether specialized training, known as supervised finetuning, can enhance an LLM’s capacity to communicate its uncertainty effectively. The researchers also explored whether these improvements could extend to new tasks and unfamiliar domains.
Understanding AI’s Self-Assessment
The core of the research revolved around two distinct metacognitive tasks designed to evaluate how LLMs express confidence. The first was single-question confidence estimation, where the model provides a numerical confidence score (e.g., 0.75) alongside its answer to a single question. To measure performance here, two metrics were key: calibration, which assesses how well the model’s stated confidence aligns with its actual accuracy (e.g., if it says 90% confident, it should be correct 90% of the time), and discrimination, which measures its ability to assign higher confidence to correct answers compared to incorrect ones.
The second task was pairwise confidence comparison. In this scenario, the model was presented with two questions and asked to identify which one it was more likely to answer correctly. This task provides a way to assess discrimination without requiring a numerical score, similar to how humans might make relative judgments about their knowledge.
The Training Approach and Key Findings
To improve the LLMs’ uncertainty communication, the researchers employed supervised finetuning. They trained two types of LLMs, GPT-4.1 mini and Llama3.1 70B, using datasets covering general knowledge, mathematics, and open-ended trivia. The training involved generating ‘consistency-based uncertainty signals’ – essentially, by sampling multiple responses for each question and calculating how consistent the answers were. This consistency served as a proxy for confidence, which was then used to train the models to verbalize more accurate confidence scores.
The results were insightful:
- Improved Confidence Within and Across Domains: Finetuning significantly improved both calibration and discrimination for single-question confidence. This was true not only for questions within the domains the models were trained on but also for entirely new, unseen domains like medical and legal reasoning. This suggests that the ability to communicate uncertainty can generalize to unfamiliar content. Importantly, these improvements in confidence communication did not come at the cost of overall accuracy, which remained largely stable.
- Task-Specific vs. Generalizable Skills: A crucial finding was that improvements were often task-specific. Training an LLM solely on single-question confidence estimation did not automatically make it better at pairwise comparisons, and vice versa. This indicates that these different metacognitive skills are learned as distinct routines.
- The Power of Multitask Training: The picture changed dramatically with multitask finetuning. When models were trained jointly on both single-question confidence estimation and pairwise comparison tasks, they showed broader and more consistent improvements. This combined training led to better calibration and discrimination across tasks and domains, suggesting that exposing models to diverse forms of confidence reporting encourages the development of more shared, generalizable internal representations of uncertainty.
- LLM Differences: While both GPT-4.1 mini and Llama3.1 70B showed similar overall trends, Llama3.1 70B did not exhibit the same gains in the comparison task under multitask training, highlighting that the effectiveness of multitask training can vary across different LLM architectures.
Also Read:
- Enhancing LLM Training: Focusing on Local Steps for Better Reasoning
- Enhancing Trust in Large Language Models with Domain-Shift-Aware Uncertainty
Implications for Safer AI Deployment
This research offers valuable insights into making LLMs more reliable and transparent. The finding that uncertainty communication is trainable and generalizable, especially through multitask and multidomain training, is a significant step towards safer AI deployment. By teaching LLMs to better assess and communicate their own confidence, users can make more informed decisions, reducing the risks associated with acting on potentially incorrect AI outputs. The parallels drawn with human metacognition also suggest that, much like humans, LLMs might develop a hybrid architecture for self-assessment, combining both general and specialized components for monitoring their knowledge.


