TLDR: RLFactory is a new plug-and-play reinforcement learning framework designed to significantly improve how Large Language Models (LLMs) interact with external tools over multiple turns. It addresses challenges in tool call stability, adaptability, and diverse reward computation through an asynchronous tool call mechanism, a decoupled architecture, and a flexible reward framework. By reconstructing the MDP state space with observation markers and implementing a “generate-parse-invoke-update” workflow, RLFactory enables efficient and stable multi-turn interactions. Experiments show it outperforms larger models in performance and achieves a 6.8x increase in training throughput, making LLM-tool collaboration more robust and efficient.
Large Language Models (LLMs) have shown impressive capabilities in understanding, generating, and reasoning with natural language. However, they often struggle with tasks that require real-time information, complex calculations, or multi-step interactions with external environments. This is where the “model-tool” collaboration paradigm comes into play, allowing LLMs to use external tools like search engines, code interpreters, or databases to enhance their performance.
Multi-turn tool usage, where models dynamically adjust their strategies based on tool feedback over several interactions, is a typical form of these complex tasks. Think of planning a trip, which might involve multiple calls to different tools to gather information and refine the itinerary. While the potential is huge, implementing reinforcement learning (RL) for these multi-turn tool-use scenarios presents significant challenges, particularly concerning the stability and adaptability of tool calls due to diverse tool interfaces, and the complexity of designing varied reward computations for different task requirements.
To address these hurdles, a new framework called RLFactory has been introduced. It’s a plug-and-play reinforcement learning post-training framework specifically designed to boost LLMs’ multi-round tool-use capabilities. RLFactory aims to simplify the process for the industry to research and develop solutions for LLM tool interaction.
The Core Innovations of RLFactory
RLFactory stands out with several key design principles:
- Asynchronous Tool Call Mechanism: It uses an asynchronous approach based on asyncio, which dramatically improves efficiency. This means the model can initiate requests to multiple tools simultaneously and doesn’t have to wait for one tool to respond before calling another.
- Decoupled Architecture: The framework separates the tool invocation module from the training module. This modular design significantly reduces the effort and cost associated with setting up the tool environment, making it easier to integrate new tools.
- Diverse Reward Computation Framework: Recognizing that different tasks require different evaluation methods, RLFactory supports various reward calculation strategies. These include rule-based rewards for tasks with clear success criteria, model-based judgment for more open-ended tasks, and tool verification for scenarios where external operations are needed to validate results.
A crucial aspect of RLFactory is how it redefines the Markov Decision Process (MDP) state space. It introduces “observation markers” derived from tool feedback. These markers are dynamically added to the interaction sequence, creating a continuous feedback loop between the model, the tool, and the environment. This enables a dynamic policy optimization through a “generate-parse-invoke-update” workflow, ensuring the model continuously learns and adapts.
How RLFactory Works
The framework is built on a layered modular architecture, designed for a low-barrier, “plug-and-play” development experience. It integrates efficient asynchronous interactions, flexible reward strategies, and a decoupled tool environment, allowing models to autonomously learn optimal strategies for multi-round tool invocation.
RLFactory supports a broad definition of “tools,” categorizing them into three forms:
- Program Tools: These are standard programs like search interfaces, code interpreters, or calculators that extend the model’s computational and information-gathering abilities through direct input-output mapping.
- Model Tools: These integrate third-party models (open-source or closed-source) to supplement the LLM’s capabilities, such as using GPT-4o for summarization or Stable Diffusion for image generation.
- Agent Tools: These are complex systems that combine program and model modules to automate end-to-end tasks, like a “literature research agent” that uses search, summarization, and citation tools to produce a research report.
The multi-turn interaction flow follows a “Generate – Parse – Invoke – Update” cycle. The model generates a response, which is then parsed for tool invocation instructions. Tools are invoked asynchronously, and their results are formatted and fed back to the model as observation tokens, allowing for continuous strategy adjustment.
Also Read:
- Reinforcement Learning: The Core Driver for Advanced AI Research Systems
- Training Language Models Without New Data: The Language Self-Play Approach
Impressive Performance and Efficiency
Experiments conducted on the Search-R1 project, using Qwen3-4B as the base model, demonstrated RLFactory’s effectiveness. The trained model achieved a test score of 0.486 on the NQ dataset, outperforming larger models like Qwen2.5-7B-Instruct-GRPO (which scored 0.473) trained with the same techniques. Beyond performance, RLFactory significantly improved training throughput by 6.8 times, showcasing its exceptional efficiency and stability in optimizing resource utilization.
In conclusion, RLFactory offers a robust, adaptable, and efficient framework for enhancing LLMs’ multi-turn tool usage in real-world scenarios. It provides a low-barrier solution for advancing LLM agents, paving the way for more powerful and efficient model-tool collaboration. You can find more details about this research in the full paper.


