TLDR: A new framework called “Affordance-Guided Coarse-to-Fine Exploration” helps mobile manipulation robots achieve higher success rates (85%) by intelligently selecting their base placement. It combines visual-language understanding with geometric planning through an iterative process, allowing robots to reason about both *what* to do and *how* to physically interact with objects, even with limited perception. This method significantly outperforms previous approaches in diverse open-vocabulary tasks.
Mobile robots are becoming increasingly capable, but one persistent challenge in open-vocabulary mobile manipulation (OVMM) is ensuring the robot is positioned correctly to successfully complete a task. It’s not enough for a robot to simply be near an object; it needs to be in the right spot, facing the right way, with enough clearance to interact effectively. This crucial step, known as base placement, often determines whether a task succeeds or fails.
Traditional robot navigation systems often guide robots to a general vicinity of a target object, treating the task as complete once proximity is achieved. However, this approach frequently leads to manipulation failures because it doesn’t consider the specific ‘affordances’ of an object – that is, what actions the object allows. For example, to open a cabinet, a robot must be directly in front of the drawer with sufficient space to extend its arm, not just somewhere in the room near the cabinet.
Addressing Key Challenges in Robot Placement
Researchers Tzu-Jung Lin, Jia-Fong Yeh, Hung-Ting Su, Chung-Yi Lin, Yi-Ting Chen, and Winston H. Hsu from National Taiwan University and National Yang Ming Chiao Tung University have introduced a novel framework called “Affordance-Guided Coarse-to-Fine Exploration” to tackle this problem. Their work, detailed in their research paper Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation, proposes a zero-shot approach that integrates semantic understanding from advanced vision-language models (VLMs) with geometric feasibility through an iterative optimization process.
The framework addresses two main challenges: first, robots must reason jointly about geometric feasibility (collision-free paths, appropriate distance) and semantic intent (aligning with task-relevant features like a handle). Second, robots need to reason globally despite their limited, egocentric perceptual input, which often restricts their view to only what’s directly in front of them.
How the New Framework Works
The core of this new method lies in two key innovations:
1. Cross-modal Representations: The system creates unique representations called “Affordance RGB” and “Obstacle Map+”. These combine visual-semantic information from RGB images with spatial geometric data from obstacle maps. This allows the robot to understand both the ‘what’ (semantics) and the ‘where’ (spatial context) of a task, moving beyond the limitations of a single camera view.
2. Coarse-to-Fine Optimization: The robot uses an iterative process that starts with broad semantic guidance from VLMs to explore task-relevant regions. As the process continues, it gradually refines the search using geometric constraints to pinpoint precise, physically feasible placements. This prevents the robot from getting stuck in suboptimal positions that might look semantically correct but are geometrically impossible to execute.
In practice, when given an instruction like “Open the cabinet,” the robot first identifies a key ‘affordance point’ (e.g., the cabinet handle). It then uses VLMs to project visual cues onto a 2D obstacle map. This combined information helps the robot sample potential base placements, scoring them based on how well they align with both the task’s meaning and physical reachability. Early in the process, semantic alignment is prioritized, guiding the robot to the general area. Later iterations focus more on geometric precision, ensuring the final placement is exact and executable.
Also Read:
- Enhancing Robot Planning with Vision Language Models: Insights on Adaptive Strategies
- CoFineLLM: Enhancing Robot Planning Reliability with Smarter Language Model Training
Impressive Results and Future Directions
The Affordance-Guided Coarse-to-Fine Exploration framework was evaluated on five diverse open-vocabulary mobile manipulation tasks, including opening cabinets and dishwashers, and placing objects on shelves. The system achieved an impressive 85% success rate, significantly outperforming classical geometric planners and other VLM-based methods that often struggle with either semantic understanding or geometric feasibility.
This high success rate demonstrates the potential of combining affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in mobile manipulation. While the method shows great promise, the authors note that future work will focus on improving geometric precision in very tight spaces and incorporating arm trajectory feasibility into the optimization process to prevent collisions during manipulation.
Ultimately, this research brings us closer to a future where mobile robots can perform complex household and industrial tasks with greater reliability and autonomy, understanding not just what to do, but precisely how to do it in the physical world.


