spot_img
HomeResearch & DevelopmentCompoST: Unveiling Large Language Models' Challenges in Compositional Question...

CompoST: Unveiling Large Language Models’ Challenges in Compositional Question Interpretation

TLDR: The CompoST benchmark evaluates how well Large Language Models (LLMs) can compositionally interpret questions for knowledge bases, specifically in the context of Question Answering over Linked Data (QALD). The study found that LLMs struggle to systematically understand and translate complex questions into SPARQL queries, with performance significantly dropping as question complexity increases. Even when all necessary information (atomic building blocks) was provided in the input, the models did not achieve high compositional F1 scores, indicating a fundamental weakness in their ability to generalize compositionally.

Large Language Models (LLMs) have shown impressive capabilities in understanding and generating human language. They are increasingly used for tasks like interpreting questions and converting them into structured queries, such as SPARQL queries for knowledge bases. However, a crucial question remains: how systematically do these models interpret language, especially when it comes to complex, multi-part questions?

Language interpretation is inherently compositional, meaning the overall meaning of a sentence is derived from the meanings of its individual parts and how they are combined. For example, if an LLM understands “brown dog” and “black cat,” a truly compositional system should also understand “brown cat.” This concept is often broken down into two sub-properties: productivity (understanding new expressions never encountered before) and systematicity (understanding new combinations of known components).

A new research paper introduces CompoST, a benchmark specifically designed to investigate the systematicity of LLMs in the context of Question Answering over Linked Data (QALD). This benchmark aims to test whether LLMs can interpret structurally complex questions, given that they have already learned the atomic building blocks of those questions. The challenge with evaluating LLMs is that their vast training data makes it difficult to know what they have or haven’t seen. CompoST addresses this by focusing on systematicity in a controlled environment.

The CompoST Dataset: A Controlled Environment for Testing Compositionality

To create CompoST, the researchers generated three datasets of varying difficulty (easy, medium, hard) based on graph patterns found in DBpedia, a large knowledge base. They used a structured approach to verbalize these patterns into natural language questions and their corresponding SPARQL queries. This controlled generation ensures that the relationships between simpler and more complex questions are known, allowing for a precise evaluation of compositional understanding.

The dataset includes “pitchfork-like” or star patterns of different depths and breadths, which are then translated into SPARQL queries and verbalized. For instance, if an LLM understands “Who is the spouse of Michelle Obama?” and “Who is the parent of Malia Obama?”, a compositional system should be able to combine this knowledge to answer “Who is the spouse of Michelle Obama and parent of Malia Obama?” The dataset also includes “self-contained tasks” where all necessary information (smaller, related question-query pairs) is provided in the input, giving the model optimal conditions to demonstrate compositional abilities.

Experiments and Key Findings

The study conducted extensive experiments using various LLMs, including Llama 3.3, Phi-4, Qwen2.5-Coder, OLMo 2, and GPT-4o-mini. They tested these models using zero-shot prompting (no examples), few-shot prompting (with a few examples), and fine-tuning (training the model on the dataset).

The results revealed a consistent pattern: LLMs struggle significantly with systematic compositional interpretation. In classic tasks, where models had to generalize from training data, performance (measured by macro F1 score) degraded sharply as the complexity of the questions increased. For example, the best-performing model’s F1 score dropped from 0.45 for easy questions to just 0.09 for hard questions. This indicates that even when all necessary atomic information was present in the training data, the models failed to combine these parts to understand more complex, unseen structures.

Even in the “self-contained tasks,” where all relevant information was explicitly provided in the input, the compositionality F1 scores did not exceed 0.57. This suggests a fundamental limitation in LLMs’ ability to systematically compose answers, even when the building blocks are readily available. While fine-tuning generally improved overall performance compared to in-context learning, it did not fully resolve the compositional generalization issue, especially for more complex questions.

Also Read:

Conclusion: A Call for Deeper Compositional Understanding

The findings from the CompoST benchmark strongly suggest that current LLMs do not strictly satisfy the property of systematicity. Their performance degrades significantly with increasing structural complexity, even when all necessary information is conceptually available. This indicates that LLMs often rely on statistical heuristics or pattern matching rather than true compositional understanding. While fine-tuning can offer some improvements, it doesn’t fundamentally overcome this challenge.

This research highlights a critical area for future work: developing more specialized training techniques that can truly enhance LLMs’ ability to reason and generalize compositionally. For more details, you can refer to the full research paper: CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -