TLDR: A new research paper introduces GeoAnalystBench, a benchmark of 50 Python-based geospatial tasks designed to evaluate Large Language Models (LLMs) for spatial analysis workflow and code generation. The study reveals a significant performance gap, with proprietary models like ChatGPT-4o-mini, Claude 3.5 Sonnet, and Gemini 1.5 Flash outperforming open-source counterparts in workflow validity, structural alignment, semantic similarity, and code quality. Tasks requiring deep spatial reasoning remain challenging for all models, but domain knowledge and dataset descriptions significantly improve LLM accuracy. The benchmark highlights both the promise and limitations of current LLMs in GIS automation and suggests future directions for GeoAI research.
Large Language Models (LLMs) are rapidly changing how we interact with technology, and their potential in specialized fields like geospatial analysis is a hot topic. However, truly understanding their capabilities in complex tasks like Geographic Information Systems (GIS) workflows and code generation requires rigorous testing. A new research paper introduces GeoAnalystBench, a comprehensive benchmark designed to do just that: systematically evaluate how well LLMs perform in spatial analysis.
Unveiling GeoAnalystBench: A New Standard for GeoAI Evaluation
The paper, titled “GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation” by Qianheng Zhang, Song Gao, Chen Wei, Yibo Zhao, Ying Nie, Ziru Chen, Shijie Chen, Yu Su, and Huan Sun, highlights the need for a standardized evaluation framework. GeoAnalystBench consists of 50 Python-based tasks, all derived from real-world geospatial problems and carefully validated by GIS experts. Each task comes with a minimum expected output, and LLMs are evaluated on several fronts: the validity of their workflow, how well their structure aligns with expert logic, the semantic similarity of their generated steps, and the quality of their code using a metric called CodeBLEU.
The Performance Divide: Proprietary vs. Open-Source LLMs
The initial findings from GeoAnalystBench reveal a clear performance gap between proprietary models (like ChatGPT-4o-mini, Claude 3.5 Sonnet, and Gemini 1.5 Flash) and smaller open-source models (such as DeepSeek-R1-7B and CodeLlama-7B). Proprietary models consistently achieved high validity rates (over 93%) and stronger code alignment. For instance, Gemini-1.5-Flash showed the highest valid rate at 96%, while ChatGPT-4o-mini demonstrated the lowest mean absolute deviation (MAD) in workflow length, indicating a closer match to expert logic.
In contrast, many open-source models struggled. DeepSeek-R1-7B, for example, had a validity rate as low as 48.5% and high MAD values, suggesting it often generated incomplete or inconsistent workflows. This is partly attributed to its knowledge distillation training, which can reduce reasoning depth. However, Llama-3.1-8B stood out as a promising open-source option, achieving a 95.3% valid rate, showing that with targeted refinements, open-source LLMs can be viable for geospatial automation.
The Power of Context: Domain Knowledge and Dataset Descriptions
A crucial insight from the study is the significant role of contextual information. Providing LLMs with domain knowledge (DK) and detailed dataset descriptions (DD) greatly enhanced their accuracy across all proprietary models. This suggests that structured information about GIS concepts and data specifics helps LLMs reduce ‘hallucinations’ and generate more relevant and accurate workflows and code.
Code Quality: Syntax vs. Logic
When it comes to code generation, proprietary models again outperformed their open-source counterparts. ChatGPT-4o-mini led with the best average CodeBLEU score of 0.390. While LLMs generally produced syntactically correct code, they often struggled with lexical similarity (how closely their code resembles human-written code) and logical correctness (how well the code manipulates data in a logically similar way to expert code). This highlights a challenge: LLMs can write code that looks correct but might not perfectly replicate the logical steps or parameter choices of a human expert.
Challenging Tasks for LLMs
The benchmark also categorized spatial analysis tasks to see where LLMs struggled most. Tasks requiring deeper spatial reasoning, such as “Finding the best locations and paths” and “Determining how places are related,” proved to be the most challenging across all models. These tasks often involve complex optimization, network analysis, and multi-step decision-making. Conversely, tasks like “Measuring size, shape, and distribution” or “Detecting and quantifying patterns” were handled relatively better, as they rely more on statistical pattern recognition and geometric reasoning.
ArcPy vs. Open-Source Libraries
The study also touched upon the performance difference between ArcPy (ESRI’s Python library for GIS) and open-source geospatial packages. ArcPy-based outcomes showed better performance, likely due to its stability, well-organized documentation, and standardized functions within the ArcGIS ecosystem. Open-source libraries, while offering flexibility, can introduce inconsistencies due to their diverse syntax and update cycles.
Also Read:
- Unpacking the Performance of LLMs as Personalized Learning Assistants
- Evaluating AI-Generated 3D Models: A Quantitative Human-in-the-Loop Framework
Real-World Applications and Future Directions
Two case studies illustrate the benchmark’s practical application: identifying animal home ranges and analyzing car accident hotspots. While LLMs successfully outlined necessary steps, they sometimes fell short in parameter optimization, leading to lower-resolution outputs or different interpretations compared to human experts. This underscores the need for enhanced tuning mechanisms or built-in guidance for LLM-generated workflows.
In conclusion, GeoAnalystBench provides a valuable framework for understanding the current strengths and limitations of LLMs in GeoAI. While advanced LLMs show great promise for automating GIS workflows, there’s still a need for specialized training datasets, better spatial reasoning capabilities, and potentially human-AI collaborative strategies to bridge the gap in complex, context-dependent spatial problems. You can find the full research paper here: GeoAnalystBench Research Paper.


