RoboGPT-R1: A Two-Stage Approach for Advanced Robot Task Planning

TLDR: RoboGPT-R1 is a new two-stage training framework for robots to better understand and execute complex, multi-step instructions. It first uses supervised learning for basic knowledge, then reinforcement learning with a unique rule-based reward function to improve reasoning, visual understanding, and action consistency. This allows smaller models like Qwen2.5-VL-3B to significantly outperform larger models on challenging long-horizon tasks, demonstrating improved planning and generalization capabilities.

Robots are becoming increasingly sophisticated, but enabling them to understand and execute complex, multi-step instructions in real-world environments remains a significant challenge. Traditional methods, often relying on supervised fine-tuning (SFT) of large language models (LLMs) and vision-language models (VLMs), struggle with tasks that require extensive reasoning, common sense, and adaptation to dynamic situations. These models tend to imitate expert demonstrations rather than developing true understanding, leading to poor generalization and a lack of physical comprehension.

A new research paper, RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning, introduces an innovative solution to these problems. Authored by Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, and Haoran Li, the paper proposes RoboGPT-R1, a two-stage fine-tuning framework designed to significantly improve embodied planning for robots.

The Two-Stage Training Approach

RoboGPT-R1 tackles the limitations of existing methods by combining two powerful learning paradigms:

The first stage involves **Supervised Fine-Tuning (SFT)**. Here, the model is trained on expert sequences, allowing it to acquire foundational knowledge and basic reasoning capabilities. This initial phase is crucial for providing the model with a strong base before more complex learning begins, ensuring stability and integrating relevant knowledge quickly.

The second stage utilizes **Reinforcement Learning (RL)**, specifically the Group Relative Policy Optimization (GRPO) algorithm. This stage is where RoboGPT-R1 truly shines, addressing the model’s shortcomings in visual-spatial understanding, reasoning, and generalization. Unlike SFT, which learns predefined answers, RL enables the model to explore optimal solutions independently, adapt to dynamic environments, and correct errors.

A Novel Reward System for Robotic Planning

A key innovation in RoboGPT-R1 is its rule-based variable reward function, meticulously designed for long-horizon embodied reasoning and planning. This reward function consists of two complementary components:

Format Reward: This component ensures that the robot’s output is structured, executable, and follows a logical cognitive loop (perception, reasoning, planning, action). It checks for the presence and correct typing of required fields like visual state descriptions, reasoning and reflection, language plans, and executable plans. It also penalizes invalid or fabricated actions, guiding the model to generate coherent and structured outputs.
Accuracy Reward (LCS-based): Crucially, for multi-step tasks, the order of actions is as important as the actions themselves. Traditional reward systems often fail to capture this. RoboGPT-R1 introduces an accuracy reward based on the Longest Common Subsequence (LCS) between the predicted and reference action sequences. This method enforces both content accuracy and sequence coherence, making it robust to minor deviations and highly effective for long, complex tasks. It allows the model to recover from early mistakes, providing a denser and more informative learning signal than strict matching or prefix-based rewards.

The overall reward is a weighted combination of these two, with the LCS-based accuracy reward carrying a higher weight (0.8) to emphasize sequential correctness and long-horizon performance.

Impressive Performance and Generalization

RoboGPT-R1, trained on the Qwen2.5-VL-3B model, demonstrates remarkable performance on the EmbodiedBench benchmark. It significantly outperforms larger-scale models like GPT-4o-mini by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33%. Even more impressively, it achieves competitive results with closed-source models such as GPT-4o and Gemini-2.0-flash.

For long-horizon tasks, where many models struggle, RoboGPT-R1 achieves an accuracy of 50%, a substantial improvement over previous state-of-the-art methods. The framework also shows improved generalization capabilities in unseen scenarios, indicating its ability to transfer learned skills to new environments.

Also Read:

Efficiency and Future Impact

Despite using a relatively small 3B-parameter model, RoboGPT-R1 delivers high performance at a low inference cost, highlighting its parameter efficiency. Ablation studies confirm that while SFT establishes initial planning competence, the subsequent RL stage, especially when combined with augmented data, is essential for closing the gap on long-horizon tasks and enhancing generalization. The LCS-based reward function is also validated as a superior approach for providing effective learning signals during reinforcement fine-tuning.

In conclusion, RoboGPT-R1 represents a significant step forward in embodied planning. By combining supervised learning with a sophisticated reinforcement learning approach and a novel, sequence-aware reward function, it enables robots to perform complex, multi-step tasks with greater reasoning, physical understanding, and adaptability, even with smaller, more efficient models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RoboGPT-R1: A Two-Stage Approach for Advanced Robot Task Planning

The Two-Stage Training Approach

A Novel Reward System for Robotic Planning

Impressive Performance and Generalization

Efficiency and Future Impact

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates