spot_img
HomeResearch & DevelopmentRoboGPT-R1: A Two-Stage Approach for Advanced Robot Task Planning

RoboGPT-R1: A Two-Stage Approach for Advanced Robot Task Planning

TLDR: RoboGPT-R1 is a new two-stage training framework for robots to better understand and execute complex, multi-step instructions. It first uses supervised learning for basic knowledge, then reinforcement learning with a unique rule-based reward function to improve reasoning, visual understanding, and action consistency. This allows smaller models like Qwen2.5-VL-3B to significantly outperform larger models on challenging long-horizon tasks, demonstrating improved planning and generalization capabilities.

Robots are becoming increasingly sophisticated, but enabling them to understand and execute complex, multi-step instructions in real-world environments remains a significant challenge. Traditional methods, often relying on supervised fine-tuning (SFT) of large language models (LLMs) and vision-language models (VLMs), struggle with tasks that require extensive reasoning, common sense, and adaptation to dynamic situations. These models tend to imitate expert demonstrations rather than developing true understanding, leading to poor generalization and a lack of physical comprehension.

A new research paper, RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning, introduces an innovative solution to these problems. Authored by Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, and Haoran Li, the paper proposes RoboGPT-R1, a two-stage fine-tuning framework designed to significantly improve embodied planning for robots.

The Two-Stage Training Approach

RoboGPT-R1 tackles the limitations of existing methods by combining two powerful learning paradigms:

The first stage involves **Supervised Fine-Tuning (SFT)**. Here, the model is trained on expert sequences, allowing it to acquire foundational knowledge and basic reasoning capabilities. This initial phase is crucial for providing the model with a strong base before more complex learning begins, ensuring stability and integrating relevant knowledge quickly.

The second stage utilizes **Reinforcement Learning (RL)**, specifically the Group Relative Policy Optimization (GRPO) algorithm. This stage is where RoboGPT-R1 truly shines, addressing the model’s shortcomings in visual-spatial understanding, reasoning, and generalization. Unlike SFT, which learns predefined answers, RL enables the model to explore optimal solutions independently, adapt to dynamic environments, and correct errors.

A Novel Reward System for Robotic Planning

A key innovation in RoboGPT-R1 is its rule-based variable reward function, meticulously designed for long-horizon embodied reasoning and planning. This reward function consists of two complementary components:

  • Format Reward: This component ensures that the robot’s output is structured, executable, and follows a logical cognitive loop (perception, reasoning, planning, action). It checks for the presence and correct typing of required fields like visual state descriptions, reasoning and reflection, language plans, and executable plans. It also penalizes invalid or fabricated actions, guiding the model to generate coherent and structured outputs.

  • Accuracy Reward (LCS-based): Crucially, for multi-step tasks, the order of actions is as important as the actions themselves. Traditional reward systems often fail to capture this. RoboGPT-R1 introduces an accuracy reward based on the Longest Common Subsequence (LCS) between the predicted and reference action sequences. This method enforces both content accuracy and sequence coherence, making it robust to minor deviations and highly effective for long, complex tasks. It allows the model to recover from early mistakes, providing a denser and more informative learning signal than strict matching or prefix-based rewards.

The overall reward is a weighted combination of these two, with the LCS-based accuracy reward carrying a higher weight (0.8) to emphasize sequential correctness and long-horizon performance.

Impressive Performance and Generalization

RoboGPT-R1, trained on the Qwen2.5-VL-3B model, demonstrates remarkable performance on the EmbodiedBench benchmark. It significantly outperforms larger-scale models like GPT-4o-mini by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33%. Even more impressively, it achieves competitive results with closed-source models such as GPT-4o and Gemini-2.0-flash.

For long-horizon tasks, where many models struggle, RoboGPT-R1 achieves an accuracy of 50%, a substantial improvement over previous state-of-the-art methods. The framework also shows improved generalization capabilities in unseen scenarios, indicating its ability to transfer learned skills to new environments.

Also Read:

Efficiency and Future Impact

Despite using a relatively small 3B-parameter model, RoboGPT-R1 delivers high performance at a low inference cost, highlighting its parameter efficiency. Ablation studies confirm that while SFT establishes initial planning competence, the subsequent RL stage, especially when combined with augmented data, is essential for closing the gap on long-horizon tasks and enhancing generalization. The LCS-based reward function is also validated as a superior approach for providing effective learning signals during reinforcement fine-tuning.

In conclusion, RoboGPT-R1 represents a significant step forward in embodied planning. By combining supervised learning with a sophisticated reinforcement learning approach and a novel, sequence-aware reward function, it enables robots to perform complex, multi-step tasks with greater reasoning, physical understanding, and adaptability, even with smaller, more efficient models.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -