Improving Mistake Detection in Egocentric Videos Using a Two-Stage Expert System

TLDR: The DR-MoE framework is a new AI model designed to detect subtle and infrequent mistakes in egocentric (first-person) video data. It addresses challenges like imbalanced datasets and diverse user behaviors through a dual-stage approach. The first stage uses two ViViT models (one frozen, one LoRA-tuned) combined by a Feature Mixture-of-Experts module for robust feature extraction. The second stage employs three specialized classifiers, each optimized for different aspects of long-tailed recognition (class imbalance, ranking, generalization), whose predictions are fused by a Classification Mixture-of-Experts module. This method significantly enhances mistake detection, especially for rare errors, using only RGB video input.

Researchers have developed a novel artificial intelligence framework, named Dual-Stage Reweighted Mixture-of-Experts (DR-MoE), designed to accurately identify subtle and infrequent mistakes in egocentric video data. This advancement is particularly significant for applications like assisted living or training, where detecting errors from a first-person perspective is crucial but challenging.

The problem of mistake detection in egocentric videos is complex. Unlike general action recognition, which simply identifies what action is being performed, mistake detection focuses on the quality of the action. This requires a much finer level of detail and sensitivity to minor deviations. A major hurdle is the rarity and ambiguity of mistakes, leading to severely imbalanced datasets where correct actions far outnumber errors. Additionally, the wide variety of tasks and user behaviors in real-world videos makes it difficult for a single model to generalize effectively.

To tackle these issues, the DR-MoE framework employs a two-stage approach, leveraging complementary modeling strategies at both the feature extraction and classification levels.

Stage One: Feature Extraction

In the initial stage, the system extracts spatiotemporal representations from the video using two specialized ‘experts’ based on the ViViT model. One ViViT model is kept ‘frozen’ to retain its general understanding of human actions, learned from extensive pre-training. The second ViViT model is fine-tuned using a technique called Low-Rank Adaptation (LoRA), allowing it to focus specifically on cues that indicate mistakes. These two distinct representations are then combined through a learnable Feature Mixture-of-Experts (F-MoE) module. This module dynamically weighs the contributions of each expert based on the input video, ensuring the most relevant features are prioritized for analysis.

Also Read:

Stage Two: Classification

The combined features from the first stage are then passed to the second stage, which consists of three independently optimized classifiers. Each classifier is designed to address a specific challenge inherent in mistake detection:

Reweighted Cross-Entropy Loss: This classifier is trained to compensate for the imbalanced data by assigning higher importance to rare mistake classes. This helps improve the detection rate for underrepresented errors.
AUC Loss: This objective directly optimizes the Area Under the ROC Curve (AUC), which is a robust metric for evaluating model performance in imbalanced classification scenarios. It focuses on correctly ranking mistake instances higher than correct ones.
Label-Aware Loss with Sharpness-Aware Minimization (SAM): This classifier aims to improve the model’s generalization and robustness. It adjusts predictions based on class frequencies and uses SAM to find more stable and reliable decision boundaries, making the model less sensitive to minor variations in data.

The predictions from these three diverse classifiers are then adaptively integrated by a Classification Mixture-of-Experts (C-MoE) module. Similar to the F-MoE, this module dynamically weighs each classifier’s output, allowing the system to make flexible and robust decisions under various data conditions.

The DR-MoE method has demonstrated strong performance, particularly in identifying rare and ambiguous mistake instances. Notably, it achieves competitive results using only RGB video input, matching or even surpassing models that rely on multiple input modalities. This research was presented as the solution by the MR-CAS team for the Mistake Detection Challenge of the HoloAssist 2025 competition. For more technical details, you can refer to the full research paper: Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Improving Mistake Detection in Egocentric Videos Using a Two-Stage Expert System

Stage One: Feature Extraction

Stage Two: Classification

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates