TLDR: A new study introduces GeoSQL-Eval, the first automated evaluation framework and benchmark (GeoSQL-Bench) for assessing how well Large Language Models (LLMs) generate PostGIS queries from natural language. It evaluates LLMs across four cognitive dimensions, five proficiency levels, and twenty task categories, using a dataset of 14,178 questions. The framework systematically tests knowledge acquisition, syntactic generation, semantic alignment, execution accuracy, and robustness, revealing key strengths and weaknesses of current LLMs in handling complex spatial database tasks.
Large Language Models (LLMs) have made incredible strides in understanding natural language and generating structured queries for databases, a field often called NL2SQL. However, applying these powerful AI models to geographic information systems, specifically for generating queries in the PostGIS environment (known as GeoSQL), presents unique challenges. This is due to the intricate nature of spatial functions, specialized geometric data types, and complex execution rules inherent in spatial databases.
Current evaluation methods often fall short, focusing either on general relational databases or on code generation for platforms like Google Earth Engine, leaving a significant gap for systematic benchmarks tailored to spatial databases. To bridge this gap, a new study introduces GeoSQL-Eval, the first comprehensive, end-to-end automated evaluation framework specifically designed for PostGIS query generation.
Introducing GeoSQL-Eval and GeoSQL-Bench
GeoSQL-Eval is built upon Webb’s Depth of Knowledge (DOK) model, offering a robust framework that covers four cognitive dimensions, five proficiency levels, and twenty distinct task categories. This allows for a thorough assessment of an LLM’s performance across various aspects: how well it acquires knowledge, its ability to generate correct syntax, its semantic understanding, the accuracy of its query execution, and its overall robustness.
Alongside this framework, the researchers developed GeoSQL-Bench, a massive benchmark dataset. This dataset comprises 14,178 questions, spanning three main task types, incorporating 340 PostGIS functions, and utilizing 82 domain-specific databases. This extensive dataset ensures a comprehensive and realistic testing ground for LLMs.
How LLMs Were Evaluated
The study systematically evaluated 24 representative LLMs across six categories, including general non-reasoning models, reasoning-enhanced models, code generation models, geospatial code generation models, general SQL generation models, and specialized GeoSQL generation strategies. Using advanced statistical analyses, the researchers uncovered significant differences in model performance, identified common error patterns, and analyzed resource consumption.
Key findings highlight that while LLMs show promise, they face specific hurdles in GeoSQL tasks. For instance, models often struggle with ‘Function Hallucination’ (inventing non-existent functions) and ‘Parameter Misuse’ (incorrectly ordering or typing parameters). Other common issues include ‘Invalid Geometry Construction’ and ‘SRID and Projection Errors’, which are critical for accurate spatial analysis. These errors are often subtle and difficult to detect with standard syntax checks, making robust evaluation crucial.
Evaluation Layers and Metrics
The GeoSQL-Eval framework assesses models across several layers:
- Conceptual Understanding: Evaluates basic knowledge of spatial queries, function purposes, parameter types, and behavioral specifications.
- Structured SQL Generation: Checks the syntactic correctness and executability of generated queries within a PostGIS environment.
- Semantic Alignment & Invocation: Focuses on how accurately generated queries align with database schemas, including table and field recognition, and correct function invocation.
- Execution & Result Accuracy: Validates the effectiveness of queries by comparing their execution results (numerical, textual, Boolean, or geometric) against reference answers.
- Robust Generalization & Reasoning: Assesses model stability across multiple attempts and its resilience to variations or ambiguities in natural language inputs.
Also Read:
- Generating Tailored Data for Text-to-SQL Systems: Introducing SING-SQL
- Evaluating LLMs for Complex JSON Data Extraction: Introducing DeepJSONEval
Core Conclusions and Insights
The evaluation revealed several important insights:
- Task Complexity: While models generally perform well on conceptual understanding, syntax generation and table schema retrieval tasks pose greater challenges, especially with complex logic and multi-table joins.
- Model Performance: Leading general non-reasoning and reasoning-enhanced models (like GPT-5, o4-mini, and DeepSeek-R1-0528) demonstrated superior performance, balancing accuracy and stability. Larger SQL generation models also showed a significant advantage over smaller ones.
- Common Errors: The majority of errors stemmed from PostGIS function misuse and general SQL syntax errors, indicating a need for better understanding of spatial function specifics and complex query structures.
- Efficiency: General non-reasoning models were generally more efficient, while some reasoning-enhanced models like o4-mini and GPT-5 found a good balance between accuracy and computational cost.
This pioneering work not only pushes the boundaries of NL2SQL applications but also provides a standardized, interpretable, and scalable framework for evaluating LLM performance in spatial database contexts. The insights gained are invaluable for optimizing LLMs and advancing applications in geographic information science, urban studies, and spatial analysis. The researchers have also established a public GeoSQL-Eval leaderboard to foster ongoing testing and collaboration within the global research community. You can find more details about this research paper here.


