Nazonazo: Japanese Riddles Uncover LLM's Insight Problem

TLDR: The Nazonazo benchmark, using Japanese children’s riddles, evaluates Large Language Models’ (LLMs) insight-based reasoning. It reveals that most LLMs, except GPT-5, significantly underperform humans. The study highlights LLMs’ struggle with representational shifts and metacognitive control, often generating correct answers but failing to endorse them, suggesting a critical area for future AI improvement.

In the rapidly evolving world of Artificial Intelligence, evaluating the true capabilities of Large Language Models (LLMs) has become a significant challenge. Many existing benchmarks are reaching a point of ‘saturation,’ where state-of-the-art models score so highly that it’s hard to distinguish real progress. This situation, dubbed an ‘evaluation crisis,’ calls for new, more robust testing methods.

A recent research paper, “The Nazonazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs”, introduces a novel approach: using Japanese children’s riddles, known as ‘Nazonazo,’ as a benchmark. Authored by Masaharu Mizumoto, Dat Nguyen, Zhiheng Han, Jiyuan Fang, Heyuan Guan, Xingfu Li, Naoya Shiraishi, Xuyang Tian, Yo Nakawake, and Le Minh Nguyen, this study proposes a cost-effective and scalable solution to assess insight-based reasoning in LLMs.

What is Nazonazo?

Nazonazo are traditional Japanese wordplay riddles, often short and requiring no specialized domain knowledge. They typically involve a ‘representational shift’ or ‘cognitive restructuring’ to solve, meaning the answer isn’t found through simple logical steps but by looking at the problem in a completely new way. For example, a riddle might play on homophones or decompose Kanji characters to reveal the solution. This makes them ideal for testing ‘insight problem solving,’ a cognitive ability where a solution suddenly appears after an initial impasse.

Why Nazonazo Challenges LLMs

The researchers highlight three key capacities Nazonazo probes that are particularly challenging for AI:

Representational Shift: Moving from a fixed initial understanding of the problem to a flexible re-interpretation.
Metacognitive Control: Managing multiple hypotheses, assessing confidence, and dynamically selecting candidates.
Non-linear Search: Accommodating unexpected discoveries and intuitive changes in confidence, rather than just systematic, step-by-step searching.

While current AI excels at systematic search, these non-linear and metacognitively regulated processes are where they often fall short.

The Study: Humans vs. LLMs

The study evaluated 38 frontier LLMs and 126 adults on a set of 120, and later 201, Nazonazo riddles. The results were striking: human performance achieved a mean accuracy of 52.9%, while most LLMs, with the notable exception of GPT-5, were not comparable to human performance. Many models scored less than half of the human average.

Interestingly, the study found that ‘reasoning models’ (LLMs designed with explicit reasoning capabilities) significantly outperformed ‘non-reasoning models.’ However, even these advanced reasoning models only achieved around 27% accuracy, underscoring the inherent difficulty of the task for AI. Furthermore, the size of the model (parameter count) showed no reliable association with accuracy, suggesting that simply making models larger isn’t the solution for insight-based reasoning.

The ‘Verification Failure’ and Metacognitive Gaps

A deep dive into the LLMs’ ‘thought logs’ (the internal reasoning steps they generate) revealed a critical weakness: ‘verification failure.’ Models often produced the correct solution among intermediate candidates but failed to select it as the final answer. This suggests a lack of robust ‘metacognitive feelings’—the AI equivalent of a human’s ‘feeling of rightness’ (FoR) or ‘feeling of error’ (FoE).

The logs showed models expressing phrases like “I’m stuck,” “feels closest,” or “feels like a dead end,” mirroring human experiences of impasse and trial-and-error. They even showed ‘Aha!’ moments. However, these ‘feelings’ in LLMs were often unreliable, leading to ‘false Aha!’ moments where the model confidently identified a wrong answer, or conversely, missing a correct answer it had already generated. This indicates that while LLMs can simulate metacognitive processes, their internal ‘feelings’ for correctness are not yet as reliable as those in humans.

Also Read:

Future Directions for AI

The Nazonazo benchmark offers a clear path for future AI development. By focusing on tasks that require genuine restructuring rather than just recall or search, it measures a model’s core reasoning ability. The insights from the thought-log analysis provide concrete guidance: strengthening metacognitive feelings for correctness and improving the control of search processes are key areas for improving AI’s problem-solving capabilities. This research paves the way for an ‘AI metacognitive psychology,’ systematically studying how AI systems can better deploy human-like control signals to enhance their reasoning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Nazonazo: Japanese Riddles Uncover LLM’s Insight Problem

What is Nazonazo?

Why Nazonazo Challenges LLMs

The Study: Humans vs. LLMs

The ‘Verification Failure’ and Metacognitive Gaps

Future Directions for AI

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Infibeam Avenues Reports Stellar 93% Revenue Growth, Pivots to AI-Driven Payment Solutions

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates