Robots Learn Pick-and-Place Tasks with Visual Cues and Advanced AI

TLDR: A new robotic system uses visual prompts (bounding boxes) and an AI algorithm called Action Chunking with Transformers (ACT) to perform pick-and-place tasks in complex environments like convenience stores. By learning from human demonstrations and predicting sequences of actions, the robot can adapt to diverse objects and cluttered scenes, showing improved accuracy and adaptability.

Robotic systems are becoming increasingly vital in various industries, including retail, where tasks like picking up and placing items in convenience stores present unique challenges. These environments are often cluttered with densely arranged objects, frequent occlusions, and a wide variety of products differing in shape, size, color, and texture. Traditional robotic approaches often struggle with these complexities, relying on predefined rules or extensive scene understanding that lack adaptability.

A new research paper introduces an innovative approach to tackle these challenges, combining ‘annotation-guided visual prompting’ with an advanced imitation learning algorithm called Action Chunking with Transformers (ACT). This system aims to make robotic pick-and-place operations smoother, more adaptive, and data-driven.

Annotation-Guided Visual Prompting: Simplifying Robot Vision

The core idea behind annotation-guided visual prompting is to provide robots with structured spatial guidance using simple bounding box annotations. Instead of requiring the robot to fully understand every detail of a complex scene, these bounding boxes directly highlight the object to be picked and the precise location for placement. This method significantly reduces the computational burden on the robot’s perception system, making it more efficient for dynamic retail settings where products frequently change.

Action Chunking with Transformers (ACT): Learning from Human Expertise

Complementing the visual prompting is Action Chunking with Transformers (ACT), an imitation learning algorithm. Unlike traditional methods that break down tasks into many small, individual steps, ACT allows the robotic arm to predict ‘chunked’ action sequences. This means the robot learns to perform coherent segments of a task, such as an entire ‘picking’ motion or a ‘placing’ motion, based on human demonstrations. This approach, inspired by how humans perform tasks, enables the robot to execute actions more fluidly and adaptively, moving away from rigid, step-by-step planning.

The ACT system uses a Transformer-based architecture, which is excellent at understanding sequences of data. It processes human-provided action sequences along with visual inputs, learning the temporal relationships between actions and the spatial relationships between objects. This allows the robot to predict the next sequence of actions based on its current state and the visual prompts.

The Robotic Setup and Experiments

The researchers utilized a Universal Robots UR5e arm equipped with a Robotiq 2-Finger Gripper for their experiments. The system also included two Intel RealSense cameras: one mounted on the robot’s hand for close-range views and another on the tabletop for a wider perspective. These cameras feed real-time images to the ACT’s neural network.

To evaluate the system, six different products commonly found in Japanese convenience stores—including noodle bowls, chocolate boxes, tea bottles, and small jars—were used. These products represented a diverse range of shapes, sizes, textures, and packaging types. The system was tested across three levels of complexity:

Simple Scenario: Nine similar-shaped boxes arranged in a 3×3 grid, with one object marked for picking.
Complex Scenario: Nine diverse products in a 3×3 grid, with one object marked for picking.
More Complex Scenario: Nine diverse products placed in varying positions, with one marked for picking and another for placement.

In the simple scenario, the system achieved a high success rate of 90%. For the more complex scenarios, initial success rates were lower (around 70%), but significantly improved to 100% in the complex scenario and 80-90% in the more complex scenario after providing the system with more diverse human demonstration data. This highlights the importance of comprehensive training data for adaptability.

The study also analyzed how ACT focused its attention using heatmaps, showing that the system intelligently shifted its focus between the picking object and the placing destination as needed. While the system performed exceptionally well with rigid objects, reflective or slippery surfaces posed more challenges, indicating areas for future refinement.

Also Read:

Looking Ahead

Despite its successes, the current system is data-demanding, relying heavily on high-quality human demonstrations. Future work will focus on developing data augmentation processes to artificially create human-like data, reducing the need for extensive manual demonstrations and further enhancing the system’s robustness and autonomy.

This research represents a significant step forward in creating more adaptive, data-driven robotic systems capable of performing complex real-world tasks, particularly in challenging retail environments. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Robots Learn Pick-and-Place Tasks with Visual Cues and Advanced AI

Annotation-Guided Visual Prompting: Simplifying Robot Vision

Action Chunking with Transformers (ACT): Learning from Human Expertise

The Robotic Setup and Experiments

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates