CompoST: Unveiling Large Language Models' Challenges in Compositional Question Interpretation

TLDR: The CompoST benchmark evaluates how well Large Language Models (LLMs) can compositionally interpret questions for knowledge bases, specifically in the context of Question Answering over Linked Data (QALD). The study found that LLMs struggle to systematically understand and translate complex questions into SPARQL queries, with performance significantly dropping as question complexity increases. Even when all necessary information (atomic building blocks) was provided in the input, the models did not achieve high compositional F1 scores, indicating a fundamental weakness in their ability to generalize compositionally.

Large Language Models (LLMs) have shown impressive capabilities in understanding and generating human language. They are increasingly used for tasks like interpreting questions and converting them into structured queries, such as SPARQL queries for knowledge bases. However, a crucial question remains: how systematically do these models interpret language, especially when it comes to complex, multi-part questions?

Language interpretation is inherently compositional, meaning the overall meaning of a sentence is derived from the meanings of its individual parts and how they are combined. For example, if an LLM understands “brown dog” and “black cat,” a truly compositional system should also understand “brown cat.” This concept is often broken down into two sub-properties: productivity (understanding new expressions never encountered before) and systematicity (understanding new combinations of known components).

A new research paper introduces CompoST, a benchmark specifically designed to investigate the systematicity of LLMs in the context of Question Answering over Linked Data (QALD). This benchmark aims to test whether LLMs can interpret structurally complex questions, given that they have already learned the atomic building blocks of those questions. The challenge with evaluating LLMs is that their vast training data makes it difficult to know what they have or haven’t seen. CompoST addresses this by focusing on systematicity in a controlled environment.

The CompoST Dataset: A Controlled Environment for Testing Compositionality

To create CompoST, the researchers generated three datasets of varying difficulty (easy, medium, hard) based on graph patterns found in DBpedia, a large knowledge base. They used a structured approach to verbalize these patterns into natural language questions and their corresponding SPARQL queries. This controlled generation ensures that the relationships between simpler and more complex questions are known, allowing for a precise evaluation of compositional understanding.

The dataset includes “pitchfork-like” or star patterns of different depths and breadths, which are then translated into SPARQL queries and verbalized. For instance, if an LLM understands “Who is the spouse of Michelle Obama?” and “Who is the parent of Malia Obama?”, a compositional system should be able to combine this knowledge to answer “Who is the spouse of Michelle Obama and parent of Malia Obama?” The dataset also includes “self-contained tasks” where all necessary information (smaller, related question-query pairs) is provided in the input, giving the model optimal conditions to demonstrate compositional abilities.

Experiments and Key Findings

The study conducted extensive experiments using various LLMs, including Llama 3.3, Phi-4, Qwen2.5-Coder, OLMo 2, and GPT-4o-mini. They tested these models using zero-shot prompting (no examples), few-shot prompting (with a few examples), and fine-tuning (training the model on the dataset).

The results revealed a consistent pattern: LLMs struggle significantly with systematic compositional interpretation. In classic tasks, where models had to generalize from training data, performance (measured by macro F1 score) degraded sharply as the complexity of the questions increased. For example, the best-performing model’s F1 score dropped from 0.45 for easy questions to just 0.09 for hard questions. This indicates that even when all necessary atomic information was present in the training data, the models failed to combine these parts to understand more complex, unseen structures.

Even in the “self-contained tasks,” where all relevant information was explicitly provided in the input, the compositionality F1 scores did not exceed 0.57. This suggests a fundamental limitation in LLMs’ ability to systematically compose answers, even when the building blocks are readily available. While fine-tuning generally improved overall performance compared to in-context learning, it did not fully resolve the compositional generalization issue, especially for more complex questions.

Also Read:

Conclusion: A Call for Deeper Compositional Understanding

The findings from the CompoST benchmark strongly suggest that current LLMs do not strictly satisfy the property of systematicity. Their performance degrades significantly with increasing structural complexity, even when all necessary information is conceptually available. This indicates that LLMs often rely on statistical heuristics or pattern matching rather than true compositional understanding. While fine-tuning can offer some improvements, it doesn’t fundamentally overcome this challenge.

This research highlights a critical area for future work: developing more specialized training techniques that can truly enhance LLMs’ ability to reason and generalize compositionally. For more details, you can refer to the full research paper: CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CompoST: Unveiling Large Language Models’ Challenges in Compositional Question Interpretation

The CompoST Dataset: A Controlled Environment for Testing Compositionality

Experiments and Key Findings

Conclusion: A Call for Deeper Compositional Understanding

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates