TLDR: Researchers introduce a novel tool for formally verifying the safety of Large Language Model (LLM) policies in memoryless sequential decision-making tasks. By encoding environment states as natural language prompts, parsing LLM actions, and incrementally building a formal model checked by Storm, the method provides rigorous safety guarantees. Experiments show that open-source LLMs can be verified but currently underperform deep reinforcement learning baselines, highlighting the feasibility of the approach and the need for improved LLM performance and efficiency for trustworthy AI deployment.
Large Language Models (LLMs) are increasingly being used for complex tasks, including sequential decision-making. Imagine an LLM controlling a self-driving taxi: it needs to make a series of choices—move, refuel, pick up passengers—based on its current situation, like its position, fuel level, and passenger status. While LLMs show great promise in these scenarios, their decisions often lack formal guarantees, meaning they can sometimes lead to unsafe or undesirable outcomes, such as running out of fuel or causing collisions.
A new research paper introduces a groundbreaking tool designed to rigorously and automatically verify the safety of LLM-based policies in these sequential decision-making tasks. The core idea is to apply a technique called “model checking,” which is a formal verification framework used to mathematically prove whether a system satisfies a given set of requirements.
Understanding the Approach
The tool focuses on “memoryless” sequential decision-making, which means the LLM’s decision at any given moment depends only on the current state of the environment, not on past events. This is similar to how a Markov Decision Process (MDP) works, where the probability of future states is solely determined by the present state and the chosen action.
The verification process involves several key steps:
- Defining the Task: The sequential decision-making task is first defined as a Markov Decision Process (MDP), which outlines the environment’s dynamics.
- LLM Policy: An LLM acts as the “policy,” meaning it proposes actions based on the current state.
- Safety Requirement: A specific safety rule, expressed as a Probabilistic Computation Tree Logic (PCTL) formula, is provided. For example, “the probability of the taxi running out of fuel must be less than 0.1.”
- State Encoding: Each state of the environment (e.g., taxi’s position, fuel level) is translated into a natural-language prompt that the LLM can understand.
- Action Parsing: The LLM’s natural-language response is then parsed back into a specific action that the system can execute.
The tool then incrementally builds only the relevant parts of the MDP—those states and transitions that are reachable by the LLM’s chosen actions. This “formal model” is then fed into a powerful model checker called Storm. Storm determines whether the LLM policy, in combination with the state encoder and action parser, satisfies the specified safety property within the environment. To ensure consistent results, the LLM’s outputs are made deterministic by setting random seeds.
Experimental Insights
The researchers tested their tool on standard benchmarks, including grid-world environments like “Frozen Lake” (navigating an icy grid without falling into holes), a “Taxi” domain (avoiding running out of fuel), and a “Stock Market” simulation (avoiding bankruptcy). They used open-source LLMs like Gemma, Llama, DeepSeek-R1, and Mistral, accessed via Ollama, an open-source platform for running LLMs locally.
The experiments showed that while open-source LLMs can indeed be formally verified when their outputs are deterministic, they generally underperform traditional deep reinforcement learning (RL) baselines in terms of achieving safety properties. For instance, an LLM might have a higher probability of the taxi running out of fuel compared to an RL agent. The verification process itself was also found to be computationally intensive, primarily due to the time it takes for LLMs to process prompts and generate responses.
Also Read:
- Verifying AI Plans: How LLMs and Formal Methods Create Reliable Agents
- VeriGuard: A New Framework for Ensuring Safety in AI Agents Through Verified Code
Future Implications
This work represents a significant step towards ensuring the trustworthiness and safety of LLM-based systems in critical decision-making roles. By providing a practical framework for formal verification, it enables continuous benchmarking of new LLM releases and user-defined tasks. This lays a crucial foundation for formally certifying future, more capable LLMs, ensuring they can be deployed reliably in environments where safety is paramount.
The paper highlights the need for further advancements in LLM policy performance, state encoding, action parsing, and inference efficiency. Despite current limitations, this framework bridges a critical gap in formally verifying LLM-based policies in sequential decision-making tasks, paving the way for more robust and explainable AI systems. You can read the full research paper here.


