TLDR: ERA is a two-stage framework that enables smaller vision language models (VLMs) to become highly capable embodied agents. It first injects foundational knowledge from diverse data sources (augmented trajectories, environment-specific data, and external knowledge) through Embodied Prior Learning (EPL). It then refines these skills using online reinforcement learning (RL) with efficient context management via self-summarization, dense reward shaping, and turn-level policy optimization. This approach allows a compact 3B model to outperform larger models like GPT-4o on complex high-level planning and low-level manipulation tasks, demonstrating strong generalization to unseen scenarios.
Recent advancements in artificial intelligence have brought us closer to creating robots that can understand and interact with the world around them. These ’embodied agents’ use Vision Language Models (VLMs) to perceive, reason, and act in complex environments. However, a significant challenge remains: the most capable VLM systems are often massive and expensive to deploy, while smaller, more efficient models typically lack the necessary knowledge and skills to perform well.
A new research paper introduces a framework called Embodied Reasoning Agent (ERA), designed to bridge this gap. ERA is a two-stage approach that transforms smaller VLMs into highly capable embodied agents by combining foundational knowledge learning with online reinforcement learning.
Stage 1: Embodied Prior Learning (EPL)
The first stage, Embodied Prior Learning, focuses on equipping smaller VLMs with essential knowledge before they even begin interacting with an environment. This is crucial because general VLMs, especially compact ones, often lack the specific understanding needed for embodied tasks. ERA distills this foundational knowledge from three distinct types of data:
-
Trajectory-Augmented Priors: Existing robot trajectory data (sequences of observations and actions) are enriched with structured reasoning generated by more powerful models. This includes detailed visual descriptions, reflections on past actions to detect errors, and step-level plans. For low-level manipulation tasks, rule-based methods are used to generate accurate visual descriptions, ensuring consistency between perception and control.
-
Environment-Anchored Priors: This provides in-environment knowledge beyond just trajectories. For high-level planning tasks, this includes ‘masked action modeling’ (predicting missing actions in a sequence) and ‘action sequence reordering’ (correctly ordering shuffled actions). For low-level control, it involves ‘absolute coordinate grounding’ (mapping objects to 3D coordinates), ‘relative coordinate grounding’ (understanding spatial relations like ‘leftmost’), and ‘combined grounding’ for joint reasoning.
-
External Knowledge Priors: To transfer general reasoning and cross-domain understanding, ERA leverages large-scale datasets from outside the embodied environment. For high-level planning, this includes datasets designed to activate chain-of-thought reasoning. For low-level control, it uses multimodal spatial reasoning datasets to improve visual perception and spatial understanding.
By combining these diverse data sources, EPL provides a robust foundation, ensuring the VLM has a strong grasp of perception, reasoning, and environmental understanding.
Stage 2: Online Reinforcement Learning (RL)
After acquiring foundational skills through EPL, the second stage refines the agent’s performance using online reinforcement learning. This allows the agent to interact with the environment, learn from trial and error, and adapt its policies. To overcome common challenges in robot RL, such as long task horizons, sparse rewards, and training instability, ERA introduces three key designs:
-
Self-Summarization for Context Management: In long-horizon tasks, the history of interactions can become very long, leading to computational inefficiency and distraction. ERA addresses this by training the model to explicitly summarize its interaction history into a concise reflection at each step. This ‘self-summarization’ mechanism reduces the context size, allowing the agent to focus on relevant information without being overwhelmed by lengthy histories.
-
Dense Reward Shaping: Traditional embodied tasks often provide rewards only upon successful completion, making it hard for the agent to learn from intermediate steps. ERA introduces a process-level reward function that integrates task completion, intermediate progress (subgoals), and behavior shaping (rewarding desirable actions and penalizing undesirable ones). This dense feedback guides exploration and stabilizes learning, especially in complex, multi-step tasks.
-
Turn-Level Policy Optimization: Unlike token-level optimization, which can be unstable for multi-turn interactions, ERA treats the agent’s entire response in a turn as a single ‘action’. This ‘turn-level’ approach estimates the value of an entire interaction turn, ensuring that credit assignment aligns with how the agent interacts with the environment. This leads to more stable and effective policy learning.
Also Read:
- ManiAgent: Orchestrating Robot Actions with AI Agents
- Smart Hints: LLMs Accelerate Reinforcement Learning in Tricky Environments
Impressive Results with a Compact Model
The researchers evaluated ERA on EmbodiedBench, a comprehensive benchmark covering both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks. Remarkably, ERA-3B, a compact 3-billion parameter model, achieved state-of-the-art performance among training-based agents. It surpassed larger, prompting-based models like GPT-4o by 8.4% on EB-ALFRED and 19.4% on EB-Manipulation.
Furthermore, ERA demonstrated strong generalization capabilities, performing significantly better on unseen tasks compared to previous RL baselines. This indicates that ERA learns robust and transferable skills rather than simply overfitting to training data. The study also highlighted the complementary roles of EPL and RL, with EPL providing essential foundational knowledge and online RL effectively refining the policy for better generalization.
The ERA framework, detailed in the research paper, offers a practical and scalable pathway toward developing more powerful and efficient VLM-based agents for real-world applications, providing valuable insights for future embodied AI systems. The work was supported in part by NSF and ORN, and utilized advanced computing systems like Delta and DeltaAI.


