Unpacking Nuance: New Benchmarks Evaluate Language Models' Pragmatic Understanding in Slovene

TLDR: Researchers introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene, comprising 405 multiple-choice questions. The study details challenges in culturally adapting datasets, establishes a human baseline (85% accuracy), and evaluates LLMs. Results show proprietary models like GPT-5 nearing human performance, but a significant gap persists with open-source models, especially in understanding culture-specific non-literal language and certain pragmatic phenomena like ‘Quantity’ and ‘Manner’ flouting.

Large language models (LLMs) are becoming increasingly sophisticated, performing well on many difficult language tasks. However, truly understanding language goes beyond just grammar (syntax) and meaning (semantics); it also involves pragmatics – grasping the situational meaning influenced by context, linguistic, and cultural norms. This deeper level of understanding, often called “nuanced language,” is crucial for effective communication, especially as LLMs become more conversational.

A new research paper, titled “From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene,” by Mojca Brglez and Špela Vintar, addresses this need by introducing the first pragmatics understanding benchmarks specifically for the Slovene language. These benchmarks, named SloPragEval and SloPragMega, consist of a total of 405 multiple-choice questions designed to test how well LLMs understand subtle, context-dependent language in Slovene.

The Challenge of Cultural Adaptation

Creating these benchmarks was not a simple translation task. The researchers highlighted significant difficulties in adapting existing English benchmarks to Slovene. Direct machine translation often resulted in “culturally maladapted datasets” that were unsuitable for non-literal language. This required extensive manual revision and adaptation by expert linguists and even crowdsourcing. Challenges included linguistic differences like idioms, metaphors, and puns, as well as cultural specifics such as names, geographical locations, and culture-bound concepts (e.g., food, holidays). For instance, a “forest resort” in English might become a “cabin” in Slovene, or a pun-based joke had to be completely re-scripted to work in the new cultural context.

Establishing a Human Baseline

To ensure the benchmarks were valid and to provide a reference point for LLM performance, a crowdsourcing campaign was conducted. 79 questionnaires were sent out, and 57 complete responses were received, resulting in at least six human answers for each of the 300 examples in the SloPragEval dataset. The human baseline showed an average accuracy of around 85%. Interestingly, humans found “Manner-violating” utterances (where the way something is said is unclear or obscure) the most difficult to interpret, with accuracies as low as 67%, while “Literal” utterances were much easier to understand, with over 90% accuracy.

Evaluating Large Language Models

The study evaluated several instruction-tuned generative models, including four open-source models (DeepSeek-R1-Distil-Qwen 14B, Gemma 3 27B, GaMS 27B, and Llama 3.3 70B) and two closed-source models (OpenAI’s GPT-5 and GPT-5-chat). The results showed a mixed picture:

On the smaller SloPragMega benchmark, proprietary models like GPT-5 achieved near-perfect scores on some tasks, such as Humour and Metaphor. Open-source models, especially smaller ones, struggled more, particularly with Humour when prompted in Slovene.
For SloPragEval, the state-of-the-art GPT-5 achieved an accuracy of about 81-83%, which is comparable to human performance (85%). However, a significant gap remains between proprietary and open-source models, with the latter scoring much lower (as low as 43-51% average).
Similar to humans, LLMs also found “Manner-flouting” utterances challenging.
The biggest performance gap between humans and LLMs was observed in the “Quantity” category (saying less or more than expected). Humans correctly interpreted over 80% of these, while the best LLM managed 76%, and open-source models ranged from 31-67%.

The researchers noted that models sometimes performed similarly or slightly better when prompted in English for SloPragEval, but similarly or better in Slovene for SloPragMega. They also suggested that the high performance of some LLMs might be partly due to overlaps with original English source texts or potential data contamination, where models might have already seen the underlying datasets during training.

Also Read:

Future Directions

The study concludes by emphasizing the importance of creating benchmarks from native data using “bottom-up approaches” to ensure linguistic and cultural authenticity. While LLMs are advancing in nuanced language understanding, especially proprietary models, there’s still room for improvement, particularly for open-source models and in handling culture-specific pragmatic phenomena. Future work will include evaluating more models and exploring open-ended evaluation protocols.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking Nuance: New Benchmarks Evaluate Language Models’ Pragmatic Understanding in Slovene

The Challenge of Cultural Adaptation

Establishing a Human Baseline

Evaluating Large Language Models

Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates