TLDR: A new study reveals that Large Language Models (LLMs) are systematically overconfident across multiple languages (English, French, German, Japanese, Mandarin), and human users consistently overrely on these confident outputs. Despite LLMs adapting their linguistic expressions of certainty to cultural norms, human reliance behaviors differ cross-linguistically, sometimes increasing the risk of overreliance even when models use more uncertainty markers. This highlights significant global safety concerns and the critical need for culturally and linguistically sensitive AI safety evaluations.
Large Language Models (LLMs) are becoming increasingly prevalent worldwide, making it crucial to understand how they convey uncertainty and how users interpret their responses across different languages. Previous research has highlighted that English LLMs often exhibit overconfidence, leading users to place too much trust in their confident outputs. However, the way people use and understand expressions of certainty or uncertainty (known as epistemic markers, like ‘It’s definitely’ or ‘I think’) can vary significantly across languages.
A recent study delves into the risks of multilingual linguistic miscalibration, overconfidence, and overreliance across five languages: English, French, German, Japanese, and Mandarin. The findings reveal that the risk of overreliance on LLMs is high across all these languages.
LLM Overconfidence Across Languages
The researchers first analyzed the distribution of epistemic markers generated by LLMs. They observed that while LLMs are indeed overconfident across different languages, they also show sensitivity to documented linguistic variations. For instance, models tend to generate the most markers of uncertainty in Japanese, while producing the most markers of certainty in German and Mandarin. This suggests that LLMs adapt their linguistic style to some extent based on the language.
Despite this linguistic adaptation, the study found that LLMs are systematically overconfident in all languages. For GPT-4o, the highest-performing model in terms of accuracy, 15.22% of generations containing strong certainty markers were incorrect across languages. For Llama 70B and 8B, these rates were even higher, at 39.15% and 49.04% respectively. Mandarin showed the highest overconfidence rates (18% for GPT-4o) compared to English (11%).
Human Reliance and Cross-Linguistic Differences
The study also measured human reliance rates across these languages. It found that users strongly rely on confident LLM generations in all languages. However, reliance behaviors do differ cross-linguistically. For example, users rely significantly more on expressions of uncertainty in Japanese than in English. This is a critical point: even if an LLM uses more hedging language in Japanese, users might still interpret those hedges as more reliable than they would in English.
To measure overreliance risk, the study combined the rate of overconfident generations with the rate of human overreliance on those confident generations. This metric represents the probability that a human will rely on an incorrect response generated by a model using a strong certainty marker. The results showed high overreliance risk across all models and languages. For GPT-4o, the average overreliance risk was nearly 10% across languages, while for Llama-3.1-8B, users were predicted to rely on incorrect responses 40% of the time. Japanese generations, in particular, had the highest risk, nearly 1.6 times that of English generations.
Also Read:
- Large Language Models Show Progress in Forecasting, Still Lag Human Superforecasters
- Unpacking Cognitive Biases in AI-Generated Content: A New Study Reveals How LLMs Influence Users
Implications for AI Safety
These findings highlight a significant global safety challenge. Even though multilingual LLMs may adhere to linguistic norms in their expression of uncertainty, they remain systematically overconfident. More importantly, human users tend to overrely on these models in all languages, and this risk can be even greater in languages like Japanese, where uncertainty expressions are common but might have a diminished function as true markers of epistemic state. This means that simply generating more hedges does not necessarily reduce the risk of overreliance if users interpret those hedges differently based on their linguistic and cultural background.
The research stresses the importance of culturally and linguistically contextualized model safety evaluations. Relying solely on an understanding of how English-speaking users interact with English markers would lead to inaccurate estimations of overreliance risks in other languages. This work underscores the need for developers to consider linguistic and social norms when building safe and calibrated language models for a global audience. For more details, you can read the full research paper here.


