TLDR: A study found that Large Language Models (LLMs) don’t consistently follow human-centric Prospect Theory when making decisions, especially when uncertainty is expressed through words like “maybe” instead of numbers. Different LLMs interpret these “epistemic markers” very differently, leading to unstable and inconsistent decision-making, though larger models show more stability. This suggests that human decision theories may not directly apply to LLMs, highlighting a need for better understanding and calibration of how LLMs handle linguistic uncertainty.
Large Language Models (LLMs) are increasingly used in situations where decisions need to be made under uncertainty. Think of applications in finance or healthcare, where a model might need to weigh different outcomes with varying probabilities. A well-known framework for understanding how humans make decisions in such scenarios is called Prospect Theory (PT). This theory, developed by Kahneman and Tversky, explains human behavior by considering factors like how we perceive risks, how much we dislike losses compared to gains (loss aversion), and how we tend to distort probabilities (probability weighting).
However, a recent study explores whether this human-centric theory truly applies to LLMs, especially when uncertainty is expressed using everyday language rather than precise numbers. Words like “maybe,” “likely,” or “uncertain” are common ways humans express doubt, but how do LLMs interpret these “epistemic markers” and do they affect their decision-making?
Researchers from the Hong Kong University of Science and Technology and Huazhong University of Science and Technology designed a three-stage experiment to investigate this. Their goal was to see if LLMs’ decisions align with Prospect Theory and how linguistic uncertainty influences their choices. You can find the full research paper here: Prospect Theory Fails for LLMs: Revealing Instability of Decision-Making under Epistemic Uncertainty.
The Experiment’s Design
The experiment was structured in three main stages:
- Stage 1: Baseline Measurement: LLMs were presented with binary choices in lottery-like scenarios, where probabilities were given as exact numbers (e.g., “30% chance of winning $100”). This stage helped estimate the models’ initial risk preferences and fit them to the parameters of Prospect Theory.
- Stage 2: Probability Mapping of Epistemic Markers: Here, the numerical probabilities were replaced with epistemic markers (e.g., “likely,” “uncertain”). The models were asked to choose between a fixed numerical probability option and an option described with a marker. By observing when the models considered both options equally attractive, the researchers inferred the numerical probability that each epistemic marker represented for the LLM. They used 14 common markers, such as “almost certain,” “highly likely,” “possible,” and “very unlikely.”
- Stage 3: Re-evaluating Decision Behavior with Markers: Finally, the researchers re-ran the original decision tasks from Stage 1, but this time, they substituted the numerical probabilities with the epistemic markers, using the probability values inferred in Stage 2. This allowed them to directly assess how linguistic uncertainty impacted the LLMs’ decision-making and their adherence to Prospect Theory.
Key Findings
The study revealed several important insights:
- Prospect Theory Fit Varies: Not all LLMs consistently fit the Prospect Theory framework. Smaller models, like Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct, showed poor alignment with PT predictions, suggesting that this human-centric theory might not reliably explain their decision-making. Larger models, such as Qwen2.5-32B-Instruct, demonstrated better alignment.
- Inconsistent Interpretation of Epistemic Markers: Different LLMs assigned vastly different numerical probabilities to the same epistemic markers. For instance, “almost certain” was interpreted as over 97% by one model but less than 83% by another. This highlights a significant lack of consistent understanding of uncertainty expressions across different language models. While the relative ordering of markers (e.g., “almost certain” is higher than “likely”) was generally consistent, the actual numerical values varied widely. Some models also “compressed” multiple distinct low-certainty markers into very similar, low probabilities, indicating a limited ability to distinguish fine-grained uncertainty.
- Linguistic Uncertainty Disrupts Consistency: Introducing epistemic markers significantly impacted the LLMs’ decision consistency and altered their Prospect Theory parameters. This suggests that LLMs’ decision-making is fragile when faced with linguistic uncertainty. While risk preference remained somewhat stable, loss aversion and probability weighting showed more profound shifts. Interestingly, for some models, epistemic markers sometimes led to behavior that was more aligned with PT, but this was inconsistent and further highlighted the instability.
- Larger Models Show More Stability: Generally, larger LLMs (e.g., Qwen2.5-32B-Instruct) exhibited more stable decision-making behavior when linguistic uncertainty was introduced, compared to smaller models which showed drastic fluctuations in their PT parameters.
Also Read:
- Unpacking LLM Behavior in Cybersecurity Games: Language and Personality Matter
- Assessing AI’s Readiness for Critical Social Decisions: A Look at Homelessness Resource Allocation
Implications for AI Development
These findings suggest that directly applying human decision-making theories like Prospect Theory to LLMs, especially in contexts with linguistic uncertainty, is problematic. LLMs may not inherently understand risk in the same way humans do, and their responses might be more a reflection of their training data’s statistical patterns rather than true cognitive reasoning.
The researchers recommend that before using human cognitive models to explain LLM behavior, thorough regression analyses and goodness-of-fit tests should be conducted. For real-world applications, especially in sensitive areas like medical diagnosis or financial advice, the inconsistency in how LLMs interpret probabilistic language poses reliability concerns. Establishing consistent standards for expressing uncertainty in LLM-driven systems is crucial. Furthermore, the study suggests that for better alignment with human-like decision-making under uncertainty, using larger LLMs (at least 14 billion parameters) might be beneficial.


