spot_img
HomeResearch & DevelopmentUnpacking Nuance: New Benchmarks Evaluate Language Models' Pragmatic Understanding...

Unpacking Nuance: New Benchmarks Evaluate Language Models’ Pragmatic Understanding in Slovene

TLDR: Researchers introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene, comprising 405 multiple-choice questions. The study details challenges in culturally adapting datasets, establishes a human baseline (85% accuracy), and evaluates LLMs. Results show proprietary models like GPT-5 nearing human performance, but a significant gap persists with open-source models, especially in understanding culture-specific non-literal language and certain pragmatic phenomena like ‘Quantity’ and ‘Manner’ flouting.

Large language models (LLMs) are becoming increasingly sophisticated, performing well on many difficult language tasks. However, truly understanding language goes beyond just grammar (syntax) and meaning (semantics); it also involves pragmatics – grasping the situational meaning influenced by context, linguistic, and cultural norms. This deeper level of understanding, often called “nuanced language,” is crucial for effective communication, especially as LLMs become more conversational.

A new research paper, titled “From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene,” by Mojca Brglez and Å pela Vintar, addresses this need by introducing the first pragmatics understanding benchmarks specifically for the Slovene language. These benchmarks, named SloPragEval and SloPragMega, consist of a total of 405 multiple-choice questions designed to test how well LLMs understand subtle, context-dependent language in Slovene.

The Challenge of Cultural Adaptation

Creating these benchmarks was not a simple translation task. The researchers highlighted significant difficulties in adapting existing English benchmarks to Slovene. Direct machine translation often resulted in “culturally maladapted datasets” that were unsuitable for non-literal language. This required extensive manual revision and adaptation by expert linguists and even crowdsourcing. Challenges included linguistic differences like idioms, metaphors, and puns, as well as cultural specifics such as names, geographical locations, and culture-bound concepts (e.g., food, holidays). For instance, a “forest resort” in English might become a “cabin” in Slovene, or a pun-based joke had to be completely re-scripted to work in the new cultural context.

Establishing a Human Baseline

To ensure the benchmarks were valid and to provide a reference point for LLM performance, a crowdsourcing campaign was conducted. 79 questionnaires were sent out, and 57 complete responses were received, resulting in at least six human answers for each of the 300 examples in the SloPragEval dataset. The human baseline showed an average accuracy of around 85%. Interestingly, humans found “Manner-violating” utterances (where the way something is said is unclear or obscure) the most difficult to interpret, with accuracies as low as 67%, while “Literal” utterances were much easier to understand, with over 90% accuracy.

Evaluating Large Language Models

The study evaluated several instruction-tuned generative models, including four open-source models (DeepSeek-R1-Distil-Qwen 14B, Gemma 3 27B, GaMS 27B, and Llama 3.3 70B) and two closed-source models (OpenAI’s GPT-5 and GPT-5-chat). The results showed a mixed picture:

  • On the smaller SloPragMega benchmark, proprietary models like GPT-5 achieved near-perfect scores on some tasks, such as Humour and Metaphor. Open-source models, especially smaller ones, struggled more, particularly with Humour when prompted in Slovene.
  • For SloPragEval, the state-of-the-art GPT-5 achieved an accuracy of about 81-83%, which is comparable to human performance (85%). However, a significant gap remains between proprietary and open-source models, with the latter scoring much lower (as low as 43-51% average).
  • Similar to humans, LLMs also found “Manner-flouting” utterances challenging.
  • The biggest performance gap between humans and LLMs was observed in the “Quantity” category (saying less or more than expected). Humans correctly interpreted over 80% of these, while the best LLM managed 76%, and open-source models ranged from 31-67%.

The researchers noted that models sometimes performed similarly or slightly better when prompted in English for SloPragEval, but similarly or better in Slovene for SloPragMega. They also suggested that the high performance of some LLMs might be partly due to overlaps with original English source texts or potential data contamination, where models might have already seen the underlying datasets during training.

Also Read:

Future Directions

The study concludes by emphasizing the importance of creating benchmarks from native data using “bottom-up approaches” to ensure linguistic and cultural authenticity. While LLMs are advancing in nuanced language understanding, especially proprietary models, there’s still room for improvement, particularly for open-source models and in handling culture-specific pragmatic phenomena. Future work will include evaluating more models and exploring open-ended evaluation protocols.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -