spot_img
HomeResearch & DevelopmentUnpacking LLM Performance in SPARQL Query Generation: The Role...

Unpacking LLM Performance in SPARQL Query Generation: The Role of Knowledge and Memorization

TLDR: This research paper investigates how Large Language Models (LLMs) generate SPARQL queries from natural language questions, focusing on the impact of training data memorization and knowledge injection. The study introduces a novel evaluation method using zero-shot, knowledge injection, and masked knowledge injection prompting strategies. Experiments on QALD-9-plus and MCWQ datasets with various LLMs reveal that explicit knowledge injection significantly improves performance. Crucially, the findings indicate a strong memorization effect, where LLMs often rely on pre-trained data rather than true reasoning, especially evident when models generate correct Wikidata URIs even with anonymized input. This suggests that LLM performance on new or private knowledge graphs might be lower than on familiar benchmarks.

In the evolving landscape of artificial intelligence, natural-language user interfaces are becoming increasingly vital, especially in Question Answering (QA) systems. A core challenge in these systems, particularly those working with Knowledge Graphs (KGs), is converting a natural-language question into a structured query like SPARQL. This process, often called Query Building, is crucial for accessing information stored in KGs.

Large Language Models (LLMs) have emerged as a promising solution to enhance the quality of question-answering functionalities. However, a significant concern arises because LLMs are trained on vast amounts of web data, making it difficult for researchers to ascertain whether specific benchmarks or knowledge graphs were already part of their training data. This raises questions about the true capabilities of LLMs versus their tendency to ‘memorize’ information.

A recent research paper, titled “SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection,” introduces a novel method to evaluate LLM quality in generating SPARQL queries from natural-language questions under various conditions. This pioneering approach allows for the first time an estimation of how training data influences the quality of QA systems improved by LLMs. The method is designed to be portable and robust, supporting any knowledge graph, making it widely applicable to various KGQA systems and LLMs.

The researchers, Aleksandr Gashkov, Aleksandr Perevalov, Maria Eltsova, and Andreas Both, explored three distinct prompting strategies for LLMs:

Zero-shot SPARQL Generation

In this condition, the LLM receives only the natural-language question and instructions to generate a SPARQL query, without any additional context or information about the knowledge graph’s entities or properties.

Knowledge Injection

Here, the LLM is provided with specific and complete knowledge necessary to generate the correct SPARQL query. This includes mappings between entity names and their corresponding URIs (Uniform Resource Identifiers) within the knowledge graph. This simulates a scenario where a Named Entity Recognition component has perfectly identified and linked relevant information.

Also Read:

Masked Knowledge Injection

This strategy is similar to knowledge injection but with a crucial difference: the actual URIs from the knowledge graph (e.g., Wikidata URIs) are anonymized or masked. They are replaced with generic, unique identifiers (e.g., ‘kg:1234’). This setup aims to prevent the LLM from relying on any memorized knowledge of specific URIs, thereby testing its true understanding and generation capabilities.

The study aimed to answer two key research questions: how well LLMs perform with perfect knowledge injection, and what impact memorization has on their SPARQL query generation abilities. To investigate these questions, experiments were conducted using two public benchmarks over the Wikidata knowledge graph: QALD-9-plus and MCWQ. QALD-9-plus is a widely used dataset, while MCWQ is less frequently encountered, allowing the researchers to assess the impact of dataset popularity and potential memorization.

Eleven different open-source LLMs were tested, ranging in size from 7 billion to 123 billion parameters, including models from Alibaba Cloud (Qwen 2.5), DeepSeek, Mistral AI, and Meta (Llama 3.3). The models were run on powerful GPUs, with all models quantized to optimize resource usage.

The findings revealed several significant insights. Knowledge injection consistently and significantly improved LLM performance, even for smaller models. This highlights that despite their extensive pre-training, LLMs greatly benefit from explicit, structured context. Conversely, zero-shot prompting yielded unsatisfactory results across most models, indicating that LLMs struggle to generate accurate SPARQL queries without some form of external knowledge or guidance.

A crucial discovery pertained to the impact of memorization. Models performed notably better on the well-known QALD-9-plus dataset compared to the less frequently used MCWQ dataset. Furthermore, during the masked knowledge injection experiments, several models still generated queries containing correct Wikidata URIs, even though the prompts provided anonymized URIs and did not refer to Wikidata. This strongly suggests that LLMs often rely on memorized training data rather than pure reasoning, raising concerns about their generalizability to new or private knowledge graphs.

The researchers also performed an error analysis, categorizing errors into invalid query formats, empty answers, incorrect sets of entities, and the unexpected occurrence of Wikidata URIs during masked injection. This analysis further supported the conclusion that memorized data significantly influences LLM output.

In conclusion, this research underscores that while knowledge injection can substantially enhance LLM performance in SPARQL query generation, practitioners must be wary of the memorization effect. The study suggests that good results might often be achieved because a benchmark was already included in the LLM’s training data, which could lead to degraded performance on new or private datasets. Future work will explore commercial LLMs, multilingual capabilities, and advanced strategies to mitigate memorization, aiming for more reliable and generalizable LLM-driven approaches for knowledge graph-based question answering. For more details, you can refer to the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -