TLDR: Researchers have introduced CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, featuring 3 hours of dances from 18 participants across various skill levels with detailed expert annotations. This dataset frames salsa as an “embodied language” with its own vocabulary, grammar, and conversational dynamics, providing a unique resource for developing AI systems that can understand and generate complex, interactive human movement, with initial benchmarks demonstrating a unified SalsaAgent model’s capabilities in solo and duet dance generation.
In the realm of artificial intelligence, significant strides have been made in understanding and generating human communication through text and voice. However, human interaction extends far beyond spoken or written words, encompassing intricate embodied movements, precise timing, and physical coordination. Modeling these complex, continuous, and bidirectionally reactive interactions between two agents presents a formidable challenge for AI systems.
Introducing CoMPAS3D: Salsa as an Embodied Language
A groundbreaking research paper titled Salsa as a Nonverbal Embodied Language–The CoMPAS3D Dataset and Benchmarks introduces CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing. This dataset is designed as a challenging testbed for developing interactive and expressive humanoid AI. The researchers, Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, and Angelica Lim, propose that salsa duet improvisation can be analyzed as an embodied language, complete with its own vocabulary, grammar, conversational dynamics, fluency levels, stylistic expression, and even dialectical variations.
The CoMPAS3D dataset comprises over three hours of leader-follower salsa dances performed by 18 dancers, spanning beginner, intermediate, and professional skill levels. What sets this dataset apart are its fine-grained expert annotations, covering more than 2,800 move segments. These annotations detail move types, combinations, execution errors, and stylistic elements, providing an unprecedented level of detail for machine learning applications.
Why Salsa?
Salsa, often called the world’s most popular partnered social dance, offers an ideal starting point for embodied interaction benchmarks. Its global reach, improvisational structure, and established evaluation criteria make it comparable to the role English played in the early development of spoken language models. Unlike existing embodied interaction datasets that often focus on isolated or acted actions, CoMPAS3D captures the continuous, adaptive flow of embodied dialogue in a naturalistic setting.
Previous dance motion datasets typically label entire sequences broadly (e.g., “jive” or “samba”) and often feature only professional performers. CoMPAS3D, in contrast, provides frame-level annotations for moves, errors, and stylistic variations across a diverse range of skill levels, addressing a critical gap in embodied data.
Dataset Details and Annotation Process
The CoMPAS3D dataset was collected using a Vicon motion capture system with 20 cameras operating at 120 frames per second. Each dancer wore 53 markers to capture high-fidelity 3D motion data. The 18 participants formed 9 dancing pairs, self-reporting their experience levels. They performed improvised dances to four different salsa music tracks, resulting in 72 sequences, each approximately 2.5 minutes long.
Approximately half of these sequences were meticulously annotated manually by a salsa expert with 15 years of experience. The annotation process involved splitting sequences into 8-beat segments, aligning with salsa’s rhythmic structure. Each segment was labeled with one of 30 primary move categories, common execution errors (such as being “off-beat” or “mixed signals”), and the presence of styling (e.g., “lady styling” or “man styling”). Detailed descriptions of moves, including hand holds and secondary combinations, were also provided.
Analysis of these annotations revealed clear distinctions between skill levels. Professionals, for instance, utilize a wider variety of moves and incorporate significantly more styling elements compared to beginners, who tend to stick to basic steps and make more timing mistakes.
Salsa as a Language: A Deeper Dive
The paper elaborates on the analogy between salsa dance and natural language:
- Lexicon and Grammar: Standardized salsa moves like Cross Body Lead and Enchufla act as vocabulary units, combined according to an implicit grammar that dictates natural transitions.
- Fluency Levels: Just like linguistic fluency, dance fluency varies with experience. Beginners show limited vocabulary and more errors, while professionals exhibit advanced moves, individualized styling, and higher speeds.
- Personal Expression and Style: Dancers convey emotion and attitude through movement quality, akin to prosody and accent in speech. This is captured in annotations as “lady styling” and “man styling.”
- Dialects and Variations: Different salsa styles (e.g., LA-style, New York, Cuban) parallel linguistic dialects, each with unique timing and movement structures. CoMPAS3D focuses on the globally popular LA-style.
- Speaker and Listener Roles: The leader initiates moves (speaker), and the follower interprets and responds in real-time (listener), with communication primarily through haptic signals.
- Synchrony and Conversational Dynamics: Improvised salsa involves bidirectional exchanges, where partners adapt to each other’s timing and style, similar to conversational repair in dialogue.
- Evaluability: Salsa performances can be objectively judged using established criteria, providing a clear basis for measuring the quality of generated dance motions.
Benchmark Tasks and SalsaAgent
To foster research in embodied nonverbal communication, the paper proposes two main benchmark tasks:
- Solo Dance Generation: Generating a leader or follower’s motion sequence based on accompanying music and a specified proficiency level.
- Duet Dance Generation: Predicting the follower’s motion given the leader’s motion and the shared musical context.
The researchers also introduce SalsaAgent, a unified multitask model designed to perform both solo and duet dance generation. Built on the MotionLLM backbone, SalsaAgent is pretrained on motion tokens and fine-tuned in a multitask setting. Experiments show that SalsaAgent produces motions considerably closer to ground truth compared to existing baselines, particularly in duet generation, indicating better synchronization between dancers.
Also Read:
- ChoreoMuse: Crafting Dynamic Dance Videos from Music and Images with Style Control
- A New Perception System for Humanoid Robots: Understanding Complex Environments
Future Directions and Impact
While CoMPAS3D currently focuses on a single dance genre and a limited number of pairs, it lays a robust foundation for future work. Potential applications include developing advanced embodied AI agents capable of social physical interaction, creating adaptive salsa dance training systems (e.g., virtual or augmented reality partners), and building move classifiers. The dataset also opens avenues for studying human-human interaction, including interpersonal synchrony breakdowns.
The long-term goal is to train humanoid robots that can safely and creatively dance with humans, using haptic signaling as a primary form of nonverbal communication. CoMPAS3D represents a significant step towards advancing socially interactive AI, embodied modeling, and nonverbal human-AI collaboration.


