TLDR: InterAct is a new large-scale, multi-modal dataset capturing dynamic, expressive, and interactive activities between two people in daily scenarios. It includes synchronized audio, body motions, and facial expressions for over 240 sequences, each lasting a minute or longer, with diverse emotional and relational contexts. The paper also introduces a diffusion-based method to generate realistic two-person animations from speech, addressing limitations of previous single-person or limited-interaction datasets by modeling complex, long-term interactions.
Understanding and recreating the intricate dance of human interaction has long been a challenge in computer graphics and artificial intelligence. Most previous efforts have focused on single individuals or limited conversational gestures between two people, often assuming static body positions. However, real-world interactions are far more dynamic, expressive, and span larger spaces and longer durations.
A new research paper, titled “InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios,” introduces a groundbreaking multi-modal dataset designed to capture the full complexity of two-person interactions. This work, led by Leo Ho, Yinghao Huang, Dafei Qin, Mingyi Shi, Wangpok Tse, Wei Liu, Junichi Yamagishi, and Taku Komura, aims to bridge the gap in existing datasets by providing a rich resource for modeling objective-driven, dynamic, and semantically consistent interactions.
The InterAct Dataset: A New Window into Human Behavior
The core contribution of this research is the InterAct dataset itself. It comprises 241 motion sequences, each featuring two actors performing a realistic and coherent scenario for a minute or longer. What makes InterAct unique is its comprehensive capture of multi-modal data: audios, detailed body motions (including intricate hand movements), and facial expressions for both participants are recorded simultaneously. The scenarios are diverse, covering various relationships (e.g., family, friends, co-workers, doctor-patient) and a wide spectrum of 26 emotion labels, ensuring a rich tapestry of human behavior.
To create this dataset, a sophisticated capture system was employed. A 28-camera VICON optical MoCap system tracked body motions across a 5m x 5m space, with 53 body markers and 20 finger markers placed on each actor. Facial expressions were captured using iPhones mounted on head rigs, which also held two microphones for audio recording. Crucially, a wireless timecode generator ensured precise temporal synchronization between the motion and face capture systems, achieving frame-level accuracy.
The researchers went to great lengths to ensure the diversity of the dataset, recruiting 7 actors (4 males, 3 females) of varying ages and nationalities. Actors performed impromptu, guided only by character setups and scenario descriptions, often utilizing props like chairs to simulate real-life situations such as a classroom or a doctor’s office. The data is meticulously processed into standard formats: WAV for audio, BVH for body motions, and ARKit blendshape parameters for facial expressions, along with action labels (sit, walk, stand) for each actor at every frame.
Unveiling Dynamic Interactions
Statistical analysis of InterAct reveals its superiority in capturing dynamic and long-term interactions compared to previous datasets. Unlike those focusing on static conversational gestures, InterAct shows much greater diversity in individual motions, relative distances, and body orientations between actors. This indicates that InterAct truly captures large-scale and long-term movements, not just small-scale gestures.
Further analysis explored the entropy of body motions and variance of facial animations. Female actors, professional settings, and emotions like surprise, fear, and positive feelings showed higher body entropy, suggesting more varied and dynamic performances. Similarly, facial variance was higher in the lip area when actors faced each other, and emotions like joy and surprise led to greater lip movement, aligning with real-world observations.
Qualitative observations highlight numerous instances of interesting dynamic interactions: a person shifting away defensively, reactive motions like a high-five, emotional expressions such as jumping in joy, or collaborative tasks involving imaginary objects. This rich variety of scenarios, requiring genuine interaction and collaboration, sets InterAct apart.
A Baseline for Generating Interactive Motions
Beyond the dataset, the paper also presents a simple yet effective diffusion-based method for generating interactive facial expressions and body motions from speech inputs. This baseline model addresses the challenge of jointly estimating synchronized and interaction-aware non-verbal behaviors for two participants.
For facial motion synthesis, the model adapts a diffusion transformer architecture, incorporating additional inputs like dialogue audio, speaker identities, and whether the two persons are facing each other. A novel fine-tuning mechanism is introduced to improve lip accuracy while preserving individual actor’s unique lip shapes, preventing the common issue of mode collapse seen in simpler fine-tuning approaches.
Body motion synthesis is achieved through a hierarchical approach. Instead of directly regressing all joints simultaneously (which is complex due to high dimensionality), the model first estimates lower-body joints from control signals (mel-spectrograms, BERT features from transcripts, and action labels), and then regresses upper-body joints conditioned on both the control signals and the already-generated lower-body movements. This hierarchical mechanism proved more effective for handling the diverse and large-scale body movements present in InterAct.
Experiments and user studies confirm that the proposed method generates realistic and diverse facial and body animations, outperforming existing state-of-the-art methods in various metrics and user perception. The ability to control generated motions through contextual conditions like relationships, action labels, and emotional states further enhances its utility.
Also Read:
- EHWGesture: A Comprehensive Resource for Understanding Clinical Hand Movements
- Training-Free Video Object Segmentation with LLM-Powered Hierarchical Reasoning
Ethical Considerations and Future Directions
The researchers emphasize their commitment to ethical practices, having obtained informed consent from all participants and anonymizing all data in the final dataset. They acknowledge the potential ethical concerns surrounding AI-generated content and plan to work with the broader research community to ensure responsible data handling and release.
While InterAct represents a significant leap, the authors also discuss limitations, including constraints of the capture setup (e.g., head-mounted cameras limiting close facial interactions), the acting nature of the data, and the current baseline model’s reliance on speech and simple contextual cues. Future work will focus on finer-grained control, improved physical plausibility, and better generalization.
The InterAct dataset and its accompanying baseline model offer immense utility for future research in human dyad studies, full-body synthesis, socially-aware motion generation for robotics and VR, and advancing challenging computer vision tasks like dual-subject human mesh recovery. The data and code are made available at https://hku-cg.github.io/interact/ to facilitate further research and reproducibility.


