TLDR: This research introduces a new multimodal dataset and framework for generating spatially grounded, context-aware gestures for AI agents. By combining synthetic pointing gestures and real VR-based dialogues, standardized in HumanML3D format, the work enables more natural and situated communication for virtual humans, addressing a key gap in current motion generation models and showing improved performance when fine-tuning existing models.
Creating artificial intelligence agents that can communicate like humans is a complex challenge, especially when it comes to generating gestures that are not only natural but also spatially aware. Current AI models often struggle with this, either focusing on general movements or isolated speech-aligned gestures without considering the surrounding environment.
A new research paper, “Grounded Gesture Generation: Language, Motion, and Space,” addresses this critical gap by introducing a novel multimodal dataset and a comprehensive framework. This work aims to enable AI agents to produce gestures that are deeply connected to their environment and conversational context, much like humans do when pointing to objects or referring to locations during a dialogue.
The core of this research lies in combining two significant data resources. First, a synthetic dataset of spatially grounded referential gestures was created, capturing precise 3D target locations for pointing motions. Second, the MM-Conv dataset, a VR-based collection of two-party dialogues, was utilized. This dataset captures natural conversations in virtual reality environments, including synchronized motion, speech, and 3D scene information, where participants interact with shared virtual spaces.
Both datasets have been standardized into the HumanML3D format, which is a widely recognized format in human motion modeling. This standardization is crucial for integrating different types of motion data and making it compatible with advanced generative models. Together, these resources provide over 7.7 hours of rich, synchronized data, offering an unprecedented foundation for studying grounded communication.
The framework also connects to a physics-based simulator, which allows for the generation of even more synthetic data and provides a realistic environment for evaluating how well the AI agents perform situated gestures. As a proof-of-concept, the researchers fine-tuned an existing motion generation model called OmniControl on this new combined dataset. OmniControl is known for its ability to control human motion with text prompts and spatial constraints.
The experiments showed promising results. Fine-tuning the model on the new dataset consistently improved the naturalness and accuracy of the generated gestures, especially for pointing motions. This indicates that adapting pre-trained models with task-specific, spatially grounded data is highly beneficial for creating more realistic and context-aware AI behaviors.
Also Read:
- VisualSpeaker: Bridging the Gap in 3D Facial Animation Realism
- Connecting Vision and Language: A Graph-Based Approach for Detailed Video Descriptions
This research marks a significant step towards building more embodied and communicative AI agents that can interact naturally within 3D environments. By bridging the gap between gesture modeling and spatial grounding, it lays a strong foundation for future advancements in situated gesture generation and multimodal interaction. You can read the full research paper here: Grounded Gesture Generation: Language, Motion, and Space.


