TLDR: Researchers introduce Simia-SFT and Simia-RL, two frameworks enabling AI agents to be trained using LLM-simulated environments instead of costly real-world setups. LLMs act as environment simulators, generating realistic feedback and rewards, allowing for scalable data synthesis and reinforcement learning. This approach leads to significant performance gains for open models, often surpassing larger proprietary models on various benchmarks, and simplifies agent training by replacing complex environment engineering with flexible LLM-based simulation.
Large Language Model (LLM) agents are becoming increasingly sophisticated, capable of complex reasoning and problem-solving in specific, well-defined environments. However, these agents often struggle when faced with broader, more complex real-world scenarios that demand adaptability across various tools and situations. Traditionally, creating specialized environments for training these agents is a time-consuming and fragile process, which significantly slows down progress in AI agent development.
A new research paper, authored by Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, and Saravan Rajmohan, introduces a groundbreaking approach to overcome this challenge. The paper, titled “Simulating Environments with Reasoning Models for Agent Training,” demonstrates that LLMs themselves can effectively simulate realistic environment feedback, even without access to actual testbed data or application programming interfaces (APIs). This capability allows for a more flexible and scalable way to train AI agents.
Introducing Simia-SFT and Simia-RL
Inspired by the LLMs’ ability to act as environment simulators, the researchers propose two innovative frameworks: Simia-SFT and Simia-RL. Simia-SFT is a pipeline designed to synthesize Supervised Fine-Tuning (SFT) data. It takes small initial sets of data and expands them into diverse training scenarios, all in a way that is independent of any specific environment. This means developers don’t need to build a new environment for every training task.
Simia-RL, on the other hand, is a framework that enables Reinforcement Learning (RL) training without the need for real-world environment implementations. Instead, it uses feedback generated by LLM simulations. Together, these frameworks offer a path to scalable agent training by replacing the heavy and often brittle traditional environment setups with flexible, LLM-based simulations.
How LLMs Simulate Environments
The core idea is that LLMs possess a “world modeling” ability, allowing them to generate coherent environment dynamics, state transitions, and tool interactions. This means an LLM can act as a virtual world, responding to an agent’s actions just as a real environment would. The simulation process involves providing the LLM with interaction history, tool usage specifications, reference trajectories, and desired environment response formats. The LLM then reasons to produce plausible feedback, including simulated tool outputs and error messages.
For Simia-SFT, the process involves several stages: an LLM-based pre-filtering to ensure the quality of initial data, careful prompt design to guide the generation, the LLM trajectory simulation itself to create diverse multi-round interactions, and finally, rule-based checks to ensure structural correctness of the generated data.
For Simia-RL, the LLM-based simulator provides both environment observations and reward signals. This allows agents to learn and optimize their policies through multi-turn interactions within these simulated worlds, without ever needing to touch a real-world system.
Also Read:
- APOLLO: Enhancing LLM Agent Training for Extended Tasks with Human Guidance
- Unveiling AI’s Research Prowess: A New Benchmark for LLM Agents
Impressive Results
The research shows that fine-tuning open models, such as Qwen3-8B and Qwen2.5-32B-Instruct, on these simulated trajectories leads to significant performance improvements across various benchmarks. For instance, on the τ2-Bench, their 32B model surpassed GPT-4o and xLAM-2-70B, while their 8B model outperformed Qwen2.5-32B-Instruct in specific domains like Airline and Retail tasks. The models also showed strong performance on OfficeBench and AgentBench, demonstrating their ability to handle complex workflows and web navigation tasks.
A particularly interesting finding is that RL training on simulated environments can sometimes yield even better results than training on real environments, especially when the simulated environment provides richer, more adaptive feedback. For example, in an office task, a simulated environment could explain why an event creation failed (e.g., “conflicts with an existing event (Lunch Break)”), allowing the agent to learn and adjust, whereas a real environment might only give a generic “Failed to create event” message.
This work suggests a practical new recipe for training AI agents: replace complex, environment-specific code with flexible LLM simulators. This approach reframes the challenge of environment engineering into a more manageable task of prompt and schema design, paving the way for broader and more scalable progress in the field of agentic LLM training. You can read the full paper for more details at arXiv:2511.01824.


