TLDR: A new study evaluates and improves the confidence reliability of Large Language Models (LLMs) in code reasoning tasks. It finds that models with explicit reasoning capabilities, like DeepSeek-Reasoner, exhibit superior self-assessment. While prompt optimization offers limited gains, mathematical calibration (e.g., Platt Scaling) significantly and consistently enhances confidence reliability across various LLMs and tasks, making their self-reported confidence more aligned with actual correctness. The research highlights the importance of reliable confidence for efficient software development and points to future directions for optimizing LLM trustworthiness.
Large Language Models (LLMs) are rapidly transforming the field of code intelligence, assisting developers with tasks like code review, debugging, and testing. As their use becomes more widespread, understanding the reliability and trustworthiness of their outputs in code reasoning tasks is paramount. This is where ‘confidence’ comes into play – an LLM’s own assessment of how likely its answer is to be correct.
A recent research paper, Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs, delves into this critical aspect, proposing a framework to analyze and enhance the confidence reliability of LLMs specifically for code reasoning. Authored by Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, and Xin Xia, the study offers a comprehensive look at how well mainstream LLMs gauge their own accuracy and explores methods to improve this self-assessment.
Why Confidence Matters in Code Reasoning
Imagine an LLM suggesting a fix for a bug or predicting a program’s output. Developers need to know how much they can trust that suggestion. If an LLM provides an answer with high confidence, developers might spend less time verifying it. Conversely, low confidence would prompt more thorough checks or even a regeneration of the answer. This ‘self-awareness’ from the LLM can significantly streamline software development, making human-AI collaboration more efficient and reliable than sifting through complex technical reports.
However, current LLMs aren’t perfect. They can sometimes be confidently wrong, or provide uncertain outputs for complex logic. The goal of this research is to make LLM confidence truly ‘reliable’ – meaning the model’s stated confidence consistently aligns with the actual correctness of its answers across various tasks.
The Research Approach: A Three-Pronged Strategy
The researchers adopted a systematic approach involving three main steps:
- Empirical Study: First, they evaluated the ‘intrinsic’ confidence of various LLMs by asking them to provide both an answer and a confidence score for code reasoning questions. This established a baseline for how reliable their confidence was initially.
- Prompt Strategy Optimization: Next, they experimented with different ways of prompting the LLMs to generate confidence. Two strategies were explored: a ‘reassess’ strategy, where the LLM was asked to re-evaluate its confidence assuming its initial answer might be wrong (a form of self-doubt), and a ‘reflective’ strategy, where a separate LLM acted as an independent evaluator of the main LLM’s answer.
- Mathematical Calibration: Finally, they applied mathematical techniques, specifically Platt Scaling, to adjust the raw confidence scores. This method aims to align the LLM’s predicted probabilities more closely with the actual correctness of its answers.
Key Findings: DeepSeek-Reasoner Leads, Calibration is Key
The study yielded several significant insights into LLM confidence:
- Reasoning Capabilities are Crucial: Models with explicit reasoning capabilities, such as DeepSeek-Reasoner, consistently demonstrated the best confidence reliability across various tasks. They were better at accurately assessing their own correctness compared to models without such explicit reasoning.
- Open-Source vs. Closed-Source: Interestingly, mainstream open-source LLMs generally outperformed closed-source models like GPT-3.5 Turbo in terms of confidence reliability for code reasoning tasks. GPT-3.5 Turbo often exhibited highly unreliable confidence, sometimes performing worse than random guessing.
- Model Scale and Task Complexity: While larger models in the Qwen3 series showed slight improvements in confidence reliability for some tasks, this improvement wasn’t universal and sometimes plateaued. Task complexity also played a major role; confidence was generally less reliable for more intricate intermediate state reasoning tasks (like Program State Prediction and Execution Path Prediction) compared to simpler ones (like Code Coverage Prediction and Output Prediction).
- Prompt Strategies Offer Limited, Task-Specific Gains: Prompt optimization methods, like asking the model to reassess its confidence, showed some potential for improvement, particularly for high-performing models and specific tasks. However, their effectiveness was not universal and depended heavily on the model’s inherent reasoning ability and the complexity of the task. For lower-performing models, these strategies had minimal impact.
- Mathematical Calibration is a Game Changer: The most impactful finding was the significant and consistent improvement brought by mathematical calibration methods like Platt Scaling. This technique systematically enhanced confidence reliability across all models, tasks, and confidence generation methods. It effectively reduced bias and made the LLM’s confidence predictions much more aligned with reality, often turning previously unreliable (negative) performance scores into positive ones. This suggests that mathematical calibration is a powerful tool for making LLM confidence trustworthy.
Balancing Reliability and Practical Utility
While mathematical calibration proved highly effective, the researchers also highlighted a limitation: it can sometimes narrow the range of confidence distributions, making it harder for developers to distinguish between different levels of risk. If all confidence scores cluster too closely, the model’s ability to discriminate between truly high-confidence and low-confidence outputs might decrease, impacting its practical utility for nuanced decision-making in critical scenarios.
Also Read:
- The Layered Journey of Calibration in Language Models
- Decoding Code Agent Decisions: An Analysis of Success and Failure Paths
Looking Ahead
This research provides a foundational understanding and technical reference for applying confidence in LLM-assisted software engineering. It underscores that while current LLMs show promise, especially with reasoning capabilities and mathematical calibration, there’s still room for improvement, particularly in maintaining the interpretability and discriminative power of confidence scores for complex tasks. Future work will likely focus on dynamic calibration strategies and integrating confidence with active learning and risk control to further enhance the trustworthiness and practical application of LLMs in real-world development.


