TLDR: The DR-MoE framework is a new AI model designed to detect subtle and infrequent mistakes in egocentric (first-person) video data. It addresses challenges like imbalanced datasets and diverse user behaviors through a dual-stage approach. The first stage uses two ViViT models (one frozen, one LoRA-tuned) combined by a Feature Mixture-of-Experts module for robust feature extraction. The second stage employs three specialized classifiers, each optimized for different aspects of long-tailed recognition (class imbalance, ranking, generalization), whose predictions are fused by a Classification Mixture-of-Experts module. This method significantly enhances mistake detection, especially for rare errors, using only RGB video input.
Researchers have developed a novel artificial intelligence framework, named Dual-Stage Reweighted Mixture-of-Experts (DR-MoE), designed to accurately identify subtle and infrequent mistakes in egocentric video data. This advancement is particularly significant for applications like assisted living or training, where detecting errors from a first-person perspective is crucial but challenging.
The problem of mistake detection in egocentric videos is complex. Unlike general action recognition, which simply identifies what action is being performed, mistake detection focuses on the quality of the action. This requires a much finer level of detail and sensitivity to minor deviations. A major hurdle is the rarity and ambiguity of mistakes, leading to severely imbalanced datasets where correct actions far outnumber errors. Additionally, the wide variety of tasks and user behaviors in real-world videos makes it difficult for a single model to generalize effectively.
To tackle these issues, the DR-MoE framework employs a two-stage approach, leveraging complementary modeling strategies at both the feature extraction and classification levels.
Stage One: Feature Extraction
In the initial stage, the system extracts spatiotemporal representations from the video using two specialized ‘experts’ based on the ViViT model. One ViViT model is kept ‘frozen’ to retain its general understanding of human actions, learned from extensive pre-training. The second ViViT model is fine-tuned using a technique called Low-Rank Adaptation (LoRA), allowing it to focus specifically on cues that indicate mistakes. These two distinct representations are then combined through a learnable Feature Mixture-of-Experts (F-MoE) module. This module dynamically weighs the contributions of each expert based on the input video, ensuring the most relevant features are prioritized for analysis.
Also Read:
- Boosting Video Encoding Efficiency with ResidualViT
- CoachMe: An AI System for Detailed Movement Correction in Sports
Stage Two: Classification
The combined features from the first stage are then passed to the second stage, which consists of three independently optimized classifiers. Each classifier is designed to address a specific challenge inherent in mistake detection:
-
Reweighted Cross-Entropy Loss: This classifier is trained to compensate for the imbalanced data by assigning higher importance to rare mistake classes. This helps improve the detection rate for underrepresented errors.
-
AUC Loss: This objective directly optimizes the Area Under the ROC Curve (AUC), which is a robust metric for evaluating model performance in imbalanced classification scenarios. It focuses on correctly ranking mistake instances higher than correct ones.
-
Label-Aware Loss with Sharpness-Aware Minimization (SAM): This classifier aims to improve the model’s generalization and robustness. It adjusts predictions based on class frequencies and uses SAM to find more stable and reliable decision boundaries, making the model less sensitive to minor variations in data.
The predictions from these three diverse classifiers are then adaptively integrated by a Classification Mixture-of-Experts (C-MoE) module. Similar to the F-MoE, this module dynamically weighs each classifier’s output, allowing the system to make flexible and robust decisions under various data conditions.
The DR-MoE method has demonstrated strong performance, particularly in identifying rare and ambiguous mistake instances. Notably, it achieves competitive results using only RGB video input, matching or even surpassing models that rely on multiple input modalities. This research was presented as the solution by the MR-CAS team for the Mistake Detection Challenge of the HoloAssist 2025 competition. For more technical details, you can refer to the full research paper: Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection.


