AI Agents Learn to Plan Complex Tasks from Egocentric Videos

TLDR: Researchers introduce a novel AI model called TAMFormer that enables agents to plan high-level actions from egocentric instructional videos. It uses a “Topological Affordance Memory” to store environmental affordances and goal associations, allowing the agent to understand possible actions and recover from deviations, leading to more robust and effective task planning in interactive environments.

Teaching artificial intelligence (AI) agents to perform complex activities by observing human demonstrations is a significant challenge. This is especially true when the observations come from an egocentric perspective, meaning the AI sees the world through the eyes of the demonstrator, much like a first-person video. This kind of learning is vital for developing advanced applications such as augmented reality assistants that guide users through tasks or sophisticated human-robot collaboration systems.

Traditional methods for task planning in AI often assume a complete understanding of the environment, where all possible states and actions are predefined. However, this approach falls short when dealing with dynamic, real-world visual inputs, where information is often incomplete and situations can change unexpectedly. Existing visual planning methods also frequently lack immediate feedback, making it difficult for an AI to correct mistakes or adapt if its planned actions don’t go as expected.

Introducing Interactive Action Planning

To address these limitations, researchers have proposed a new task called “Interactive Action Planning.” This task emphasizes several key elements: the ability to learn skills offline from recorded experiences without needing direct interaction during training, evaluating performance in an interactive environment, comprehending the given task’s description, and planning a sequence of actions while being fully aware of the surrounding environment.

The core idea is to enable AI agents to reason about their objectives, determine possible interactions in the current situation, and decide on appropriate actions. Just as humans draw upon past experiences to solve new problems, an AI agent needs to extract useful information from its history when planning in a new environment.

Topological Affordance Memory (TAM)

Inspired by how the human hippocampus plays a crucial role in memory by associating past experiences, the researchers introduce a novel memory structure called Topological Affordance Memory (TAM). This memory stores an expert’s successful past experiences in achieving specific goals. TAM consists of three main components:

Localization: This network identifies the most similar past situation in the memory based on the agent’s current visual observation.
Affordance Learning: This module learns what actions are possible or “afforded” by the current environment. For example, it learns that you can “grab a plate” if a plate is visible and within reach.
Goal Association: This function helps verify whether the agent’s current progress aligns with the overall task goal, especially when there might be multiple ways to achieve it.

These components work together to organize expert experiences into a structured representation, localize current observations to retrieve relevant memories, and enable replanning if the agent deviates from its intended path.

Smart Action Generation and Replanning

To generate coherent action sequences, the system uses an auto-regressive sequence model, specifically a Transformer decoder, which predicts actions based on the task goal, the history of actions taken, and the retrieved memories from TAM. This allows the agent to integrate past experiences and the task goal into its decision-making process.

A critical feature of this approach is its replanning algorithm. Action execution can often deviate from the planned sequence. To tackle this, the system can “auto-correct” by optimizing the goal association score and adjusting the localized memory node. This ensures that even if unexpected events occur or actions are misexecuted, the agent can still retrieve good memories and find an alternative valid plan, preventing catastrophic failures.

Also Read:

Experimental Validation

The proposed method, named TAMFormer, was evaluated in the VirtualHome environment, a realistic interactive 3D simulation that allows for complex interactions with objects. The evaluation included various scenarios:

Pure-text: Planning based solely on language descriptions.
Visual static: Planning with visual cues but without real-time environmental feedback.
Visual interactive: Planning with real-time feedback from the interactive environment.
Visual interactive attack: The most stringent test, where predicted actions are randomly permuted to assess the model’s robustness to deviations.

The results demonstrated that TAMFormer significantly outperformed baseline models, showing a notable improvement in its ability to execute actions correctly and achieve goals, especially in interactive and attack scenarios. This highlights the effectiveness of the memory structure and the replanning algorithm in handling dynamic and uncertain environments. Ablation studies further confirmed the importance of each component of the TAMFormer model, particularly the replanning module and the learned localization network.

While promising, the current method has a limitation: it assumes that the expert demonstrations it learns from are flawless. In real-world instructional videos, human actions can be imperfect or varied. Future work could explore incorporating “bad” memories to help the AI avoid undesirable actions. For more technical details, you can refer to the full research paper: What to Do Next? Memorizing skills from Egocentric Instructional Video.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Agents Learn to Plan Complex Tasks from Egocentric Videos

Introducing Interactive Action Planning

Topological Affordance Memory (TAM)

Smart Action Generation and Replanning

Experimental Validation

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates