Assessing How Large Language Models Understand Structured Information

TLDR: SKA-Bench is a new benchmark designed to rigorously evaluate how Large Language Models (LLMs) understand structured data like knowledge graphs and tables, including when combined with text. It uses a fine-grained approach across four key abilities: handling noise, insensitivity to data order, integrating information, and rejecting questions when no answer exists. Initial evaluations show that even advanced LLMs still struggle with these aspects, highlighting areas for future improvement in structured knowledge comprehension.

Large Language Models, or LLMs, have shown remarkable progress in understanding and generating human-like text. However, their ability to truly grasp structured knowledge, such as information organized in databases or tables, has been less rigorously evaluated. A new research paper introduces SKA-Bench, a comprehensive benchmark designed to provide a more detailed and challenging assessment of how LLMs handle this type of information.

The researchers behind SKA-Bench argue that existing evaluation methods for structured knowledge understanding are often too simplistic or focus on only one type of data. This new benchmark aims to address these limitations by offering a fine-grained approach to diagnose the specific strengths and weaknesses of LLMs.

What is SKA-Bench?

SKA-Bench stands for Structured Knowledge Augmented QA Benchmark. It’s built around a question-answering format and incorporates four common forms of structured knowledge: Knowledge Graphs (KG), Tables, and hybrid formats combining KG with Text, and Tables with Text. This diversity allows for a more holistic evaluation of an LLM’s comprehension capabilities.

The creation of SKA-Bench instances follows a three-stage process. It begins with collecting question-answer pairs, then involves human experts to precisely identify the ‘positive knowledge units’ – the specific pieces of information needed to answer a question. Finally, ‘noisy knowledge units’ are added, which are irrelevant pieces of information designed to test the LLM’s ability to filter out distractions. This meticulous construction, involving both human annotation and LLM assistance for quality control, ensures the benchmark’s rigor and scalability.

Four Key Abilities Under Evaluation

To provide a detailed diagnosis of LLM shortcomings, SKA-Bench expands its instances into four fundamental ability testbeds:

Noise Robustness: This evaluates how well an LLM can provide accurate answers even when presented with a large amount of irrelevant or ‘noisy’ information alongside the necessary data. As the amount of noise increases, the challenge for the LLM grows.
Order Insensitivity: Structured knowledge, by nature, shouldn’t depend on the order in which its components are presented. This testbed shuffles the knowledge units to see if an LLM’s performance is affected by the arrangement of information, particularly if the relevant data is ‘lost in the middle’ of a long context.
Information Integration: This ability assesses an LLM’s capacity to combine multiple pieces of knowledge to form an answer. This includes integrating several structured knowledge units or combining structured data with unstructured text, which is a common real-world scenario.
Negative Rejection: An important aspect of an intelligent system is knowing when it doesn’t have enough information to answer a question. This testbed provides LLMs with only noisy, irrelevant knowledge units and expects them to respond with a refusal, such as “I don’t know,” rather than hallucinating an answer.

Also Read:

Key Findings from Evaluations

The researchers conducted empirical evaluations on 8 representative LLMs, including advanced models like DeepSeek-R1 and GPT-4o, as well as open-source models like Llama3.1-8B and Qwen2.5-7B. The results highlight several significant challenges for current LLMs:

Performance generally degrades as the amount of noisy information increases, indicating that LLMs still struggle with filtering irrelevant data, especially in hybrid (structured + text) scenarios.
Many LLMs exhibit a “Lost in the Middle” phenomenon, meaning their performance suffers when the crucial information is placed randomly within a long context rather than at the beginning or end.
Integrating multiple knowledge units, particularly from heterogeneous sources (like tables and text), remains a significant hurdle, with smaller LLMs struggling considerably more than larger ones.
Even advanced LLMs show vulnerability to noise interference in the negative rejection test, sometimes attempting to answer questions even when only irrelevant information is provided. Interestingly, some fine-tuned models showed unexpected strength in this area.

In conclusion, SKA-Bench serves as a valuable tool for understanding the current limitations of LLMs in structured knowledge comprehension. The findings suggest that while LLMs have made great strides, there’s still considerable room for improvement in their ability to robustly, consistently, and accurately process complex, structured information. The dataset and code for SKA-Bench are publicly available, encouraging further research in this critical area. You can find the full research paper here: SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing How Large Language Models Understand Structured Information

What is SKA-Bench?

Four Key Abilities Under Evaluation

Key Findings from Evaluations

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

SiegPath Honored with ‘Most Innovative Fintech Award’ at AI Expo Europe 2025 for AI-Driven Solutions

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates