Klear-AgentForge: A New Open-Source Pipeline for Building Versatile AI Agents

TLDR: Klear-AgentForge introduces an open-source pipeline for training high-performance AI agents, starting with the Qwen3-8B model. It uses supervised fine-tuning with synthetic data and multi-turn reinforcement learning to excel in diverse tasks like tool use and coding, achieving state-of-the-art results for its size and demonstrating competitive performance against much larger models. The research highlights the effectiveness of their training methodology and explores scaling factors for model and data, as well as the challenges and benefits of different RL strategies and test-time scaling.

The world of artificial intelligence is rapidly evolving, with a growing focus on ‘agentic’ models. Unlike traditional AI that gives a single response, agentic models can act autonomously over multiple steps to achieve complex goals, much like a human problem-solver. This capability is particularly valuable in areas like coding, where tasks often require planning, execution, and self-correction through several reasoning cycles.

However, developing these sophisticated AI agents, especially open-source ones, has been challenging due to the lack of detailed post-training methodologies. This is where Klear-AgentForge steps in, presenting a comprehensive and fully open-source pipeline for training high-performance agentic models.

Introducing Klear-AgentForge

Developed by the Klear Team at Kuaishou Technology, Klear-AgentForge is a new framework designed to enhance the agentic capabilities of large language models (LLMs). The project specifically focuses on building Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. The core idea is to unlock the potential for diverse agentic tasks, including tool use and coding, through a two-stage training process.

The Training Recipe: SFT and RL

The training methodology behind Klear-AgentForge involves two key phases:

1. Supervised Fine-Tuning (SFT): This initial stage involves training the model on a massive dataset of around 2.4 billion tokens. This data is a mix of high-quality open-source datasets and specially synthesized data for agentic tool use and coding. For tool-use, multi-turn prompting with powerful LLMs is used to generate new tools, tasks, and conversations. For coding, data includes problems from code contests and software engineering tasks from GitHub repositories, with a focus on creating ‘buggy code – fix patch – tests passing’ triplets to teach the model how to correct errors.

2. Reinforcement Learning (RL): Following SFT, the model undergoes multi-turn reinforcement learning. This is crucial for agentic tasks, as it allows the model to learn from interactions with various environments over longer sequences of actions. To overcome the challenge of ‘sparse rewards’ (where feedback is only available at the end of a long sequence), Klear-AgentForge uses a fine-grained reward mechanism that provides localized feedback at intermediate actions. The training also employs a ‘disaggregated architecture’ to improve efficiency, separating the process of generating model responses (rollouts) from the actual training updates.

Performance and Key Findings

Klear-AgentForge-8B has demonstrated impressive results across various agentic benchmarks. It significantly outperforms official post-trained Qwen3-8B models in both ‘Thinking’ and ‘Non-Thinking’ modes. The model shows strong performance in tool-use benchmarks like BFCL v3 and τ-bench (Retail and Airline domains), as well as coding benchmarks such as SWE-bench Verified and Aider Polyglot. Notably, Klear-AgentForge-8B, despite being an 8B parameter model, competes effectively with much larger models, even matching the performance of some 32B systems in coding tasks.

The research also explored several scaling factors:

Model Scaling: While larger models generally perform better, the 8B models showed a more significant performance gain through in-domain data fine-tuning, suggesting that smaller models can rapidly enhance their agentic capabilities.
Data Scaling: Increasing both the number of unique prompts and the number of trajectories per prompt led to similar improvements in model accuracy.
Reasoning Data: Interestingly, incorporating reasoning-focused SFT data did not directly enhance agentic capabilities; in fact, it led to performance drops, indicating that a careful design of data mix and training strategies is needed for models to excel in both.

In the RL analysis, the disaggregated training framework proved to be more efficient, boosting training speed by about 32%. The study also compared multi-task RL training with a model merging strategy. While both approaches yielded performance gains, multi-task RL training requires careful monitoring to prevent training collapse and does not inherently produce a synergistic boost across tasks. Model merging, while efficient, sometimes led to slight performance drops in coding tasks, possibly due to the imbalance in training data volume.

Test-time scaling, where multiple candidate solutions are generated and then selected, showed that increasing candidate diversity improves solution coverage. However, the overall performance gains are still limited by the effectiveness of the verification strategy. The ‘Agent Confidence Select’ method, which uses internal confidence estimation, showed the most stable improvements.

Also Read:

Looking Ahead

The Klear-AgentForge project aims to continue exploring ways to invoke multi-domain agentic abilities through post-training scaling. Future work will focus on ‘mid-training’ to smoothly transition base models to agentic ones, developing longer and broader RL training methods, and further researching small agentic LLMs. The team believes that while larger models currently show stronger performance, small language models offer a more compelling path for agentic AI due to their efficiency and suitability for high-volume use. A promising direction might be to train specialized small models for specific tasks and then use a meta-agent to intelligently combine their strengths.

For more in-depth technical details, you can read the full research paper here: Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Klear-AgentForge: A New Open-Source Pipeline for Building Versatile AI Agents

Introducing Klear-AgentForge

The Training Recipe: SFT and RL

Performance and Key Findings

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates