Unpacking LLM Decisions: Why Human Risk Theories Fall Short with Linguistic Uncertainty

TLDR: A study found that Large Language Models (LLMs) don’t consistently follow human-centric Prospect Theory when making decisions, especially when uncertainty is expressed through words like “maybe” instead of numbers. Different LLMs interpret these “epistemic markers” very differently, leading to unstable and inconsistent decision-making, though larger models show more stability. This suggests that human decision theories may not directly apply to LLMs, highlighting a need for better understanding and calibration of how LLMs handle linguistic uncertainty.

Large Language Models (LLMs) are increasingly used in situations where decisions need to be made under uncertainty. Think of applications in finance or healthcare, where a model might need to weigh different outcomes with varying probabilities. A well-known framework for understanding how humans make decisions in such scenarios is called Prospect Theory (PT). This theory, developed by Kahneman and Tversky, explains human behavior by considering factors like how we perceive risks, how much we dislike losses compared to gains (loss aversion), and how we tend to distort probabilities (probability weighting).

However, a recent study explores whether this human-centric theory truly applies to LLMs, especially when uncertainty is expressed using everyday language rather than precise numbers. Words like “maybe,” “likely,” or “uncertain” are common ways humans express doubt, but how do LLMs interpret these “epistemic markers” and do they affect their decision-making?

Researchers from the Hong Kong University of Science and Technology and Huazhong University of Science and Technology designed a three-stage experiment to investigate this. Their goal was to see if LLMs’ decisions align with Prospect Theory and how linguistic uncertainty influences their choices. You can find the full research paper here: Prospect Theory Fails for LLMs: Revealing Instability of Decision-Making under Epistemic Uncertainty.

The Experiment’s Design

The experiment was structured in three main stages:

Stage 1: Baseline Measurement: LLMs were presented with binary choices in lottery-like scenarios, where probabilities were given as exact numbers (e.g., “30% chance of winning $100”). This stage helped estimate the models’ initial risk preferences and fit them to the parameters of Prospect Theory.
Stage 2: Probability Mapping of Epistemic Markers: Here, the numerical probabilities were replaced with epistemic markers (e.g., “likely,” “uncertain”). The models were asked to choose between a fixed numerical probability option and an option described with a marker. By observing when the models considered both options equally attractive, the researchers inferred the numerical probability that each epistemic marker represented for the LLM. They used 14 common markers, such as “almost certain,” “highly likely,” “possible,” and “very unlikely.”
Stage 3: Re-evaluating Decision Behavior with Markers: Finally, the researchers re-ran the original decision tasks from Stage 1, but this time, they substituted the numerical probabilities with the epistemic markers, using the probability values inferred in Stage 2. This allowed them to directly assess how linguistic uncertainty impacted the LLMs’ decision-making and their adherence to Prospect Theory.

Key Findings

The study revealed several important insights:

Prospect Theory Fit Varies: Not all LLMs consistently fit the Prospect Theory framework. Smaller models, like Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct, showed poor alignment with PT predictions, suggesting that this human-centric theory might not reliably explain their decision-making. Larger models, such as Qwen2.5-32B-Instruct, demonstrated better alignment.
Inconsistent Interpretation of Epistemic Markers: Different LLMs assigned vastly different numerical probabilities to the same epistemic markers. For instance, “almost certain” was interpreted as over 97% by one model but less than 83% by another. This highlights a significant lack of consistent understanding of uncertainty expressions across different language models. While the relative ordering of markers (e.g., “almost certain” is higher than “likely”) was generally consistent, the actual numerical values varied widely. Some models also “compressed” multiple distinct low-certainty markers into very similar, low probabilities, indicating a limited ability to distinguish fine-grained uncertainty.
Linguistic Uncertainty Disrupts Consistency: Introducing epistemic markers significantly impacted the LLMs’ decision consistency and altered their Prospect Theory parameters. This suggests that LLMs’ decision-making is fragile when faced with linguistic uncertainty. While risk preference remained somewhat stable, loss aversion and probability weighting showed more profound shifts. Interestingly, for some models, epistemic markers sometimes led to behavior that was more aligned with PT, but this was inconsistent and further highlighted the instability.
Larger Models Show More Stability: Generally, larger LLMs (e.g., Qwen2.5-32B-Instruct) exhibited more stable decision-making behavior when linguistic uncertainty was introduced, compared to smaller models which showed drastic fluctuations in their PT parameters.

Also Read:

Implications for AI Development

These findings suggest that directly applying human decision-making theories like Prospect Theory to LLMs, especially in contexts with linguistic uncertainty, is problematic. LLMs may not inherently understand risk in the same way humans do, and their responses might be more a reflection of their training data’s statistical patterns rather than true cognitive reasoning.

The researchers recommend that before using human cognitive models to explain LLM behavior, thorough regression analyses and goodness-of-fit tests should be conducted. For real-world applications, especially in sensitive areas like medical diagnosis or financial advice, the inconsistency in how LLMs interpret probabilistic language poses reliability concerns. Establishing consistent standards for expressing uncertainty in LLM-driven systems is crucial. Furthermore, the study suggests that for better alignment with human-like decision-making under uncertainty, using larger LLMs (at least 14 billion parameters) might be beneficial.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking LLM Decisions: Why Human Risk Theories Fall Short with Linguistic Uncertainty

The Experiment’s Design

Key Findings

Implications for AI Development

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates