Unpacking LLM Performance in SPARQL Query Generation: The Role of Knowledge and Memorization

TLDR: This research paper investigates how Large Language Models (LLMs) generate SPARQL queries from natural language questions, focusing on the impact of training data memorization and knowledge injection. The study introduces a novel evaluation method using zero-shot, knowledge injection, and masked knowledge injection prompting strategies. Experiments on QALD-9-plus and MCWQ datasets with various LLMs reveal that explicit knowledge injection significantly improves performance. Crucially, the findings indicate a strong memorization effect, where LLMs often rely on pre-trained data rather than true reasoning, especially evident when models generate correct Wikidata URIs even with anonymized input. This suggests that LLM performance on new or private knowledge graphs might be lower than on familiar benchmarks.

In the evolving landscape of artificial intelligence, natural-language user interfaces are becoming increasingly vital, especially in Question Answering (QA) systems. A core challenge in these systems, particularly those working with Knowledge Graphs (KGs), is converting a natural-language question into a structured query like SPARQL. This process, often called Query Building, is crucial for accessing information stored in KGs.

Large Language Models (LLMs) have emerged as a promising solution to enhance the quality of question-answering functionalities. However, a significant concern arises because LLMs are trained on vast amounts of web data, making it difficult for researchers to ascertain whether specific benchmarks or knowledge graphs were already part of their training data. This raises questions about the true capabilities of LLMs versus their tendency to ‘memorize’ information.

A recent research paper, titled “SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection,” introduces a novel method to evaluate LLM quality in generating SPARQL queries from natural-language questions under various conditions. This pioneering approach allows for the first time an estimation of how training data influences the quality of QA systems improved by LLMs. The method is designed to be portable and robust, supporting any knowledge graph, making it widely applicable to various KGQA systems and LLMs.

The researchers, Aleksandr Gashkov, Aleksandr Perevalov, Maria Eltsova, and Andreas Both, explored three distinct prompting strategies for LLMs:

Zero-shot SPARQL Generation

In this condition, the LLM receives only the natural-language question and instructions to generate a SPARQL query, without any additional context or information about the knowledge graph’s entities or properties.

Knowledge Injection

Here, the LLM is provided with specific and complete knowledge necessary to generate the correct SPARQL query. This includes mappings between entity names and their corresponding URIs (Uniform Resource Identifiers) within the knowledge graph. This simulates a scenario where a Named Entity Recognition component has perfectly identified and linked relevant information.

Also Read:

Masked Knowledge Injection

This strategy is similar to knowledge injection but with a crucial difference: the actual URIs from the knowledge graph (e.g., Wikidata URIs) are anonymized or masked. They are replaced with generic, unique identifiers (e.g., ‘kg:1234’). This setup aims to prevent the LLM from relying on any memorized knowledge of specific URIs, thereby testing its true understanding and generation capabilities.

The study aimed to answer two key research questions: how well LLMs perform with perfect knowledge injection, and what impact memorization has on their SPARQL query generation abilities. To investigate these questions, experiments were conducted using two public benchmarks over the Wikidata knowledge graph: QALD-9-plus and MCWQ. QALD-9-plus is a widely used dataset, while MCWQ is less frequently encountered, allowing the researchers to assess the impact of dataset popularity and potential memorization.

Eleven different open-source LLMs were tested, ranging in size from 7 billion to 123 billion parameters, including models from Alibaba Cloud (Qwen 2.5), DeepSeek, Mistral AI, and Meta (Llama 3.3). The models were run on powerful GPUs, with all models quantized to optimize resource usage.

The findings revealed several significant insights. Knowledge injection consistently and significantly improved LLM performance, even for smaller models. This highlights that despite their extensive pre-training, LLMs greatly benefit from explicit, structured context. Conversely, zero-shot prompting yielded unsatisfactory results across most models, indicating that LLMs struggle to generate accurate SPARQL queries without some form of external knowledge or guidance.

A crucial discovery pertained to the impact of memorization. Models performed notably better on the well-known QALD-9-plus dataset compared to the less frequently used MCWQ dataset. Furthermore, during the masked knowledge injection experiments, several models still generated queries containing correct Wikidata URIs, even though the prompts provided anonymized URIs and did not refer to Wikidata. This strongly suggests that LLMs often rely on memorized training data rather than pure reasoning, raising concerns about their generalizability to new or private knowledge graphs.

The researchers also performed an error analysis, categorizing errors into invalid query formats, empty answers, incorrect sets of entities, and the unexpected occurrence of Wikidata URIs during masked injection. This analysis further supported the conclusion that memorized data significantly influences LLM output.

In conclusion, this research underscores that while knowledge injection can substantially enhance LLM performance in SPARQL query generation, practitioners must be wary of the memorization effect. The study suggests that good results might often be achieved because a benchmark was already included in the LLM’s training data, which could lead to degraded performance on new or private datasets. Future work will explore commercial LLMs, multilingual capabilities, and advanced strategies to mitigate memorization, aiming for more reliable and generalizable LLM-driven approaches for knowledge graph-based question answering. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking LLM Performance in SPARQL Query Generation: The Role of Knowledge and Memorization

Zero-shot SPARQL Generation

Knowledge Injection

Masked Knowledge Injection

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates