spot_img
HomeResearch & DevelopmentAI Agents Learn to Plan Complex Tasks from Egocentric...

AI Agents Learn to Plan Complex Tasks from Egocentric Videos

TLDR: Researchers introduce a novel AI model called TAMFormer that enables agents to plan high-level actions from egocentric instructional videos. It uses a “Topological Affordance Memory” to store environmental affordances and goal associations, allowing the agent to understand possible actions and recover from deviations, leading to more robust and effective task planning in interactive environments.

Teaching artificial intelligence (AI) agents to perform complex activities by observing human demonstrations is a significant challenge. This is especially true when the observations come from an egocentric perspective, meaning the AI sees the world through the eyes of the demonstrator, much like a first-person video. This kind of learning is vital for developing advanced applications such as augmented reality assistants that guide users through tasks or sophisticated human-robot collaboration systems.

Traditional methods for task planning in AI often assume a complete understanding of the environment, where all possible states and actions are predefined. However, this approach falls short when dealing with dynamic, real-world visual inputs, where information is often incomplete and situations can change unexpectedly. Existing visual planning methods also frequently lack immediate feedback, making it difficult for an AI to correct mistakes or adapt if its planned actions don’t go as expected.

Introducing Interactive Action Planning

To address these limitations, researchers have proposed a new task called “Interactive Action Planning.” This task emphasizes several key elements: the ability to learn skills offline from recorded experiences without needing direct interaction during training, evaluating performance in an interactive environment, comprehending the given task’s description, and planning a sequence of actions while being fully aware of the surrounding environment.

The core idea is to enable AI agents to reason about their objectives, determine possible interactions in the current situation, and decide on appropriate actions. Just as humans draw upon past experiences to solve new problems, an AI agent needs to extract useful information from its history when planning in a new environment.

Topological Affordance Memory (TAM)

Inspired by how the human hippocampus plays a crucial role in memory by associating past experiences, the researchers introduce a novel memory structure called Topological Affordance Memory (TAM). This memory stores an expert’s successful past experiences in achieving specific goals. TAM consists of three main components:

  • Localization: This network identifies the most similar past situation in the memory based on the agent’s current visual observation.
  • Affordance Learning: This module learns what actions are possible or “afforded” by the current environment. For example, it learns that you can “grab a plate” if a plate is visible and within reach.
  • Goal Association: This function helps verify whether the agent’s current progress aligns with the overall task goal, especially when there might be multiple ways to achieve it.

These components work together to organize expert experiences into a structured representation, localize current observations to retrieve relevant memories, and enable replanning if the agent deviates from its intended path.

Smart Action Generation and Replanning

To generate coherent action sequences, the system uses an auto-regressive sequence model, specifically a Transformer decoder, which predicts actions based on the task goal, the history of actions taken, and the retrieved memories from TAM. This allows the agent to integrate past experiences and the task goal into its decision-making process.

A critical feature of this approach is its replanning algorithm. Action execution can often deviate from the planned sequence. To tackle this, the system can “auto-correct” by optimizing the goal association score and adjusting the localized memory node. This ensures that even if unexpected events occur or actions are misexecuted, the agent can still retrieve good memories and find an alternative valid plan, preventing catastrophic failures.

Also Read:

Experimental Validation

The proposed method, named TAMFormer, was evaluated in the VirtualHome environment, a realistic interactive 3D simulation that allows for complex interactions with objects. The evaluation included various scenarios:

  • Pure-text: Planning based solely on language descriptions.
  • Visual static: Planning with visual cues but without real-time environmental feedback.
  • Visual interactive: Planning with real-time feedback from the interactive environment.
  • Visual interactive attack: The most stringent test, where predicted actions are randomly permuted to assess the model’s robustness to deviations.

The results demonstrated that TAMFormer significantly outperformed baseline models, showing a notable improvement in its ability to execute actions correctly and achieve goals, especially in interactive and attack scenarios. This highlights the effectiveness of the memory structure and the replanning algorithm in handling dynamic and uncertain environments. Ablation studies further confirmed the importance of each component of the TAMFormer model, particularly the replanning module and the learned localization network.

While promising, the current method has a limitation: it assumes that the expert demonstrations it learns from are flawless. In real-world instructional videos, human actions can be imperfect or varied. Future work could explore incorporating “bad” memories to help the AI avoid undesirable actions. For more technical details, you can refer to the full research paper: What to Do Next? Memorizing skills from Egocentric Instructional Video.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -