TLDR: SKA-Bench is a new benchmark designed to rigorously evaluate how Large Language Models (LLMs) understand structured data like knowledge graphs and tables, including when combined with text. It uses a fine-grained approach across four key abilities: handling noise, insensitivity to data order, integrating information, and rejecting questions when no answer exists. Initial evaluations show that even advanced LLMs still struggle with these aspects, highlighting areas for future improvement in structured knowledge comprehension.
Large Language Models, or LLMs, have shown remarkable progress in understanding and generating human-like text. However, their ability to truly grasp structured knowledge, such as information organized in databases or tables, has been less rigorously evaluated. A new research paper introduces SKA-Bench, a comprehensive benchmark designed to provide a more detailed and challenging assessment of how LLMs handle this type of information.
The researchers behind SKA-Bench argue that existing evaluation methods for structured knowledge understanding are often too simplistic or focus on only one type of data. This new benchmark aims to address these limitations by offering a fine-grained approach to diagnose the specific strengths and weaknesses of LLMs.
What is SKA-Bench?
SKA-Bench stands for Structured Knowledge Augmented QA Benchmark. It’s built around a question-answering format and incorporates four common forms of structured knowledge: Knowledge Graphs (KG), Tables, and hybrid formats combining KG with Text, and Tables with Text. This diversity allows for a more holistic evaluation of an LLM’s comprehension capabilities.
The creation of SKA-Bench instances follows a three-stage process. It begins with collecting question-answer pairs, then involves human experts to precisely identify the ‘positive knowledge units’ – the specific pieces of information needed to answer a question. Finally, ‘noisy knowledge units’ are added, which are irrelevant pieces of information designed to test the LLM’s ability to filter out distractions. This meticulous construction, involving both human annotation and LLM assistance for quality control, ensures the benchmark’s rigor and scalability.
Four Key Abilities Under Evaluation
To provide a detailed diagnosis of LLM shortcomings, SKA-Bench expands its instances into four fundamental ability testbeds:
-
Noise Robustness: This evaluates how well an LLM can provide accurate answers even when presented with a large amount of irrelevant or ‘noisy’ information alongside the necessary data. As the amount of noise increases, the challenge for the LLM grows.
-
Order Insensitivity: Structured knowledge, by nature, shouldn’t depend on the order in which its components are presented. This testbed shuffles the knowledge units to see if an LLM’s performance is affected by the arrangement of information, particularly if the relevant data is ‘lost in the middle’ of a long context.
-
Information Integration: This ability assesses an LLM’s capacity to combine multiple pieces of knowledge to form an answer. This includes integrating several structured knowledge units or combining structured data with unstructured text, which is a common real-world scenario.
-
Negative Rejection: An important aspect of an intelligent system is knowing when it doesn’t have enough information to answer a question. This testbed provides LLMs with only noisy, irrelevant knowledge units and expects them to respond with a refusal, such as “I don’t know,” rather than hallucinating an answer.
Also Read:
- Unpacking How Question Types Affect Large Language Model Performance
- ChartScope: Advancing AI’s Understanding of Visual Data
Key Findings from Evaluations
The researchers conducted empirical evaluations on 8 representative LLMs, including advanced models like DeepSeek-R1 and GPT-4o, as well as open-source models like Llama3.1-8B and Qwen2.5-7B. The results highlight several significant challenges for current LLMs:
-
Performance generally degrades as the amount of noisy information increases, indicating that LLMs still struggle with filtering irrelevant data, especially in hybrid (structured + text) scenarios.
-
Many LLMs exhibit a “Lost in the Middle” phenomenon, meaning their performance suffers when the crucial information is placed randomly within a long context rather than at the beginning or end.
-
Integrating multiple knowledge units, particularly from heterogeneous sources (like tables and text), remains a significant hurdle, with smaller LLMs struggling considerably more than larger ones.
-
Even advanced LLMs show vulnerability to noise interference in the negative rejection test, sometimes attempting to answer questions even when only irrelevant information is provided. Interestingly, some fine-tuned models showed unexpected strength in this area.
In conclusion, SKA-Bench serves as a valuable tool for understanding the current limitations of LLMs in structured knowledge comprehension. The findings suggest that while LLMs have made great strides, there’s still considerable room for improvement in their ability to robustly, consistently, and accurately process complex, structured information. The dataset and code for SKA-Bench are publicly available, encouraging further research in this critical area. You can find the full research paper here: SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs.


