TLDR: SIGMACOLLAB is a novel, interactive dataset designed to advance research in human-AI collaboration within physical environments. It features approximately 14 hours of rich, multimodal data from 85 sessions where untrained participants were guided by a mixed-reality AI assistant (SIGMA) to complete various procedural tasks. The dataset includes audio, egocentric camera views, depth maps, and tracking information, providing ecologically valid insights into real-world interaction challenges and supporting the development of more fluid human-AI teamwork.
Researchers have unveiled a new dataset called SIGMACOLLAB, specifically designed to push the boundaries of human-AI collaboration in physical environments. This innovative resource aims to address the complex challenges that arise when people and AI systems work together on real-world tasks, moving beyond traditional, static datasets.
The core idea behind SIGMACOLLAB is its application-driven and interactive nature. Instead of passively observing activities, the data was collected from 85 sessions where untrained participants were actively guided by a mixed-reality AI assistant, named SIGMA, to perform various procedural tasks. This approach ensures that the collected data reflects genuine interaction patterns and challenges encountered in practical scenarios, offering greater ecological validity.
Understanding the Need for SIGMACOLLAB
For decades, the research community has strived for fluid human-machine interaction. Building AI systems that can truly collaborate with people in the physical world – whether as virtual assistants, interactive robots, or mixed-reality guides – requires advancements across artificial intelligence, computer vision, natural language processing, and human-computer interaction. While significant progress has been made in areas like object detection and action recognition, interaction-related challenges, such as understanding human cognitive states like intentions, goals, and confusion, have been slower to evolve.
Many existing egocentric vision datasets, while rich, often capture a single actor performing an activity, making them unsuitable for studying interaction and collaboration. Even interactive datasets that involve human-human instruction don’t fully capture the unique dynamics of human-AI interaction. SIGMACOLLAB fills this gap by focusing on interactions with a standalone AI system, providing a more realistic testbed for developing and evaluating AI models in this space.
The SIGMA System and Data Collection
The data for SIGMACOLLAB was gathered using SIGMA, an open-source mixed-reality task assistive system that runs on a HoloLens 2 headset. This system guides users step-by-step through tasks, displaying virtual instructions and providing spoken guidance. It leverages multimodal models, like GPT-4o, to interpret user utterances and visual information (from egocentric cameras) to provide relevant responses.
The dataset includes a rich array of multimodal data streams: participant and system audio, egocentric camera views (color, grayscale, depth), head, hand, and gaze tracking information. These streams are synchronized and provide a comprehensive view of the interaction. Post-hoc annotations further enhance the dataset, including manual transcriptions of utterances, word-level timings, and classifications of task success.
Tasks and Participants
Eight diverse procedural tasks were used in the study, ranging from making coffee with a Nespresso machine and replacing a hard drive in a PC to crafting a pin-back button and preparing mocktails. These tasks were chosen for their varied objects, materials, and types of physical actions, presenting a wide range of computer vision challenges.
Twenty-one participants, recruited from the researchers’ organization, engaged in the data collection. Each participant attempted up to six tasks in a controlled laboratory setting. The study protocol ensured minimal researcher intervention, allowing for natural human-AI interaction. The dataset comprises 85 successful task execution sessions, totaling nearly 14 hours of interaction data.
Also Read:
- APEX System: Enhancing Scientific Workflows with Mixed Reality AI
- Language Models Powering Smarter Multi-Agent Collaboration
Key Contributions and Future Outlook
SIGMACOLLAB offers a unique resource for researchers to study real-time collaboration in physically situated settings. Its application-driven nature brings to light novel research challenges, such as detecting self-talk (understanding which user utterances require a response and which are internal monologues). The open-source nature of the SIGMA application also allows researchers to integrate and test models developed using this data directly within the target application, enabling iterative refinement and evaluation of end-to-end performance.
The creators of SIGMACOLLAB plan to use this dataset to establish new benchmarks that specifically focus on interaction-related challenges, including timing, proactive interventions, grounding, and detecting user cognitive states like frustration and confusion. The dataset is publicly available on GitHub, encouraging the wider research community to leverage this resource and contribute to the advancement of seamless human-machine collaboration in the physical world. You can find more details about the research paper here: SIGMACOLLAB: An Application-Driven Dataset for Physically Situated Collaboration.


