spot_img
HomeResearch & DevelopmentEnhancing Robotic Action Recognition with Knowledge-Infused AI

Enhancing Robotic Action Recognition with Knowledge-Infused AI

TLDR: KRAST is a new method that improves robotic action recognition from videos by using vision-language models (VLMs) augmented with structured textual knowledge. It employs a prompt-learning framework where detailed, learnable text descriptions guide the VLM, achieving over 95% accuracy on a relevant dataset using only RGB video, outperforming previous state-of-the-art methods.

Autonomous robots need to understand human actions accurately to operate safely and effectively in our homes and other complex environments. Imagine a robot assisting an elderly person; it needs to correctly identify actions like “washing hands” versus “washing dishes” to provide appropriate support. This task, known as vision-based action recognition, has traditionally been challenging due to cluttered environments, occlusions, and the subtle differences between similar actions.

While deep learning has significantly improved how computers understand video, most existing models rely solely on visual input. This can lead to errors when actions are visually similar or partially hidden. The recent rise of vision-language models (VLMs), which combine visual and textual understanding, offers a promising path forward by allowing knowledge to be transferred across different types of data.

A new research paper, KRAST: Knowledge-Augmented Robotic Action Recognition with Structured Text for Vision-Language Models, introduces an innovative approach to enhance action recognition for robots. The core idea behind KRAST is to leverage the power of pre-trained vision-language models by enriching them with specific, structured knowledge about actions. Instead of extensively retraining the entire VLM, KRAST adapts a prompt-learning framework.

In this framework, textual descriptions of each action are transformed into “learnable prompts” that guide the VLM. Think of these prompts as intelligent hints that help the model focus on the most relevant aspects of an action. The researchers explored several ways to structure and encode these textual descriptions to maximize their effectiveness.

How KRAST Works: Knowledge-Augmented Prompts

KRAST’s methodology centers on using knowledge-aware prompts. These prompts come in two main forms: continuous and discrete. Continuous prompts capture broad contextual information, while discrete prompts are derived from concise textual summaries of action descriptions. These descriptions were initially generated using a large language model like ChatGPT and then refined manually to ensure clarity and relevance.

For discrete prompts, the team developed sophisticated strategies, particularly for Segmented Knowledge Prompt Tuning (SegKPT). This involved breaking down full action descriptions into meaningful segments based on different types of knowledge:

  • Hierarchical Strategy: Actions are grouped into broader categories (e.g., “food consumption”) and then finer sub-groups (e.g., “eating activity”). Prompts are generated using both levels of categorization to capture overall meaning and specific details.
  • Semantic Strategy: Concise descriptions are created to summarize the core concept and characteristics of each action, providing an interpretable guide for the VLM.
  • Discriminative Strategy: This is crucial for distinguishing between very similar actions, like “washing hands” and “washing a towel by hand.” Prompts highlight subtle differences, such as the object being manipulated or the specific motion pattern, helping the model to separate closely related classes.

These structured prompts are then used during training to align the visual and textual representations within the VLM. During inference, only the video input is needed, and the model classifies the action by comparing its visual features to the learned text features.

Also Read:

Impressive Results on a Robotic Dataset

The KRAST method was rigorously tested on the ETRI-Activity3D dataset, a large-scale benchmark specifically designed for video-based action recognition from a robot’s perspective, focusing on elderly daily activities. The dataset includes 55 action classes performed by both elderly and young adults, captured from multiple angles.

The experiments showed remarkable performance. KRAST achieved over 95% accuracy using only standard RGB video inputs at test time. This significantly outperforms state-of-the-art approaches, many of which rely on additional data modalities like skeleton tracking or depth information, which can add computational complexity.

The researchers also found that the SegKPT strategy, combining hierarchical, semantic, and discriminative knowledge, yielded the best results. They also optimized the number of video frames sampled, finding that 32 frames provided an optimal balance between capturing temporal detail and computational efficiency.

This work underscores the significant potential of integrating structured textual knowledge into vision-language models for robust video understanding. The ultimate goal is to deploy such systems in real-time for responsive human-robot interaction, enabling robots to adapt to dynamic environments and even learn new actions over time.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -