TLDR: GVGAI-LLM is a new benchmark that evaluates Large Language Models (LLMs) as agents in arcade-style video games. It uses textual game state representations and natural language rules to test LLMs’ reasoning and problem-solving abilities in a zero-shot setting. The research found that current LLMs consistently exhibit limitations in spatial reasoning, planning, and understanding symbolic transformations, often performing poorly compared to traditional AI methods and being significantly slower.
Large Language Models, or LLMs, have shown incredible progress in understanding and generating human-like text. They are increasingly being used as intelligent agents in various interactive settings, from controlling robots to navigating websites. However, a significant challenge remains: how do we truly evaluate their ability to make decisions and solve problems in dynamic, rule-based environments, especially those requiring spatial reasoning and planning?
Existing benchmarks for LLMs often focus on language understanding, code generation, or following instructions. While valuable, these don’t fully capture the complexities of real-time decision-making in structured, symbolic worlds like video games. To address this gap, researchers have introduced GVGAI-LLM, a new benchmark designed specifically to test the reasoning and problem-solving skills of LLM agents in a diverse collection of arcade-style games.
GVGAI-LLM is built upon the General Video Game AI (GVGAI) framework, which is known for its wide variety of game dynamics and a formal language for describing video games. This allows for the rapid creation of new games and levels, helping to prevent models from simply memorizing solutions over time. A key innovation is how game scenes are presented to the LLM: they are represented by a compact set of ASCII characters, making them efficient for language models to process.
The benchmark evaluates LLMs in a ‘zero-shot’ setting, meaning the models make decisions based solely on the current game state, without any memory of past actions or game history. This forces the LLM to reason about the immediate environment and rules. The game rules themselves are translated into natural language descriptions, along with a mapping of game entities (like ‘a’ for avatar or ‘%’ for diamond) to their meanings. The LLM then receives this structured textual prompt and chooses an action, such as ‘move right’ or ‘action nil’ (do nothing).
To measure performance, GVGAI-LLM uses several interpretable metrics. The ‘meaningful step ratio’ quantifies how often an agent’s actions actually change the game environment, distinguishing purposeful moves from ineffective ones. ‘Step efficiency’ measures how well an agent achieves objectives with minimal effort. Standard ‘win rate’ and ‘normalized reward’ are also tracked to assess task completion and progress.
Experiments with various LLMs, including different versions of GPT and Gemini, revealed consistent limitations. Despite receiving clear textual prompts, LLMs frequently struggled with ‘spatial grounding errors,’ misinterpreting the layout of the game world or misjudging distances. They also showed ‘symbolic identity confusion,’ sometimes failing to understand when an entity transformed (e.g., an avatar picking up a key becoming an ‘avatar with key’). Another common issue was ‘behavioral misalignment,’ where models would choose to do nothing (‘ACTION NIL’) even when clear interactions were possible, indicating a lack of goal-directed behavior.
While some prompt modifications, like explicit coordinate tagging and verbose spatial grounding, offered partial improvements, they didn’t fully resolve these core reasoning challenges. The research highlights that current LLMs lack the sophisticated path planning capabilities of traditional AI methods like A* search, which can simulate future states. Furthermore, LLM agents were found to be significantly slower than symbolic search methods, taking seconds per move compared to milliseconds.
Also Read:
- Assessing Multimodal AI’s Counting Abilities in Real-World Scenarios
- Beyond Generic Responses: How AI Tutors Can Learn to Guide Students More Effectively
The GVGAI-LLM benchmark provides a robust and reproducible testbed for advancing research into language model capabilities, particularly their agentic behavior and contextual reasoning in structured environments. It clearly demonstrates that while LLMs excel in many areas, fundamental weaknesses in spatial understanding, symbolic reasoning, and planning persist when faced with the dynamic logic of video games. This benchmark will continue to drive research towards more capable and intelligent AI agents. You can find more details in the full research paper available at arXiv.org.


