Vision Language Models Advance Human Activity Recognition in Healthcare

TLDR: This research introduces a new method and dataset for evaluating Vision Language Models (VLMs) in dynamic human activity recognition (HAR) for remote health monitoring. It demonstrates that VLMs can achieve performance comparable to, and sometimes better than, traditional deep learning models, offering greater flexibility and efficiency for healthcare applications by interpreting patient activities and supporting natural language interactions.

In the evolving landscape of generative AI, Vision Language Models (VLMs) are showing significant promise, particularly in healthcare. A recent study explores their application in human activity recognition (HAR) for remote health monitoring, an area that has been relatively underexplored. This research highlights the flexibility and capabilities of VLMs to overcome limitations of traditional deep learning models in this critical field.

Remote health monitoring is becoming increasingly vital, especially with an aging global population. The goal is to develop intelligent systems that can continuously monitor patients while upholding their privacy. By encoding visual data and using AI models to interpret patient activities, these systems can allow clinicians to query models with questions like “What is the patient doing?”, making HAR a key component for enhancing healthcare delivery.

Traditional deep learning models for HAR often require extensive labeled datasets and are limited to a fixed set of predefined activity classes. Integrating separate HAR models into broader AI-assisted monitoring systems can also be inefficient. VLMs, however, offer a different approach. Trained on vast multimodal datasets, they can generate detailed and flexible descriptions of patient activities, generalizing across a wide range of actions without being confined to predefined labels. This allows them to recognize and describe activities not explicitly seen during training, leveraging their generative and contextual reasoning abilities.

A significant challenge in applying VLMs to HAR has been the difficulty in evaluating their dynamic and often non-deterministic outputs. To address this, the researchers introduced a descriptive caption dataset and proposed comprehensive evaluation methods. They created a caption-based dataset from the Toyota Smarthome video dataset, specifically tailored for visual-text alignment in healthcare monitoring. This dataset includes descriptive textual captions for each video, generated using a framework that integrates a VLM (GPT-4o) to create captions from visual inputs and ground-truth labels, ensuring alignment through an iterative keyword integration process.

The study employed four evaluation approaches: Keyword Matching, VLM-as-Judge, BERTScore, and Cosine Similarity. After an initial phase to assess reliability, Keyword Matching and Cosine Similarity were identified as the most dependable metrics. BERTScore was found to be misleading due to its broad focus on token similarity, while VLM-as-Judge showed lower-than-expected performance, though it positively indicated the ground-truth dataset was not biased towards GPT-4o’s outputs.

Comparative experiments were conducted against state-of-the-art deep learning models. The findings demonstrated that VLMs achieved comparable, and in some cases, superior performance in terms of accuracy. Notably, open-source VLMs like Llama3.2-Vision, DeepSeek-VL2, and InternVL2.5, despite not being explicitly trained on the dataset and using only two keyframes per video, showed competitive results. Llama3.2-Vision, for instance, surpassed several deep learning models in certain evaluations using keyword matching.

When evaluated with the cosine similarity method, which is considered a fairer assessment for VLMs due to its focus on semantic similarity, all VLMs achieved higher Mean Class Accuracy (MCA) scores. InternVL2.5 achieved the highest MCA at 83.8% in the cross-subject evaluation, outperforming all listed deep learning models. DeepSeek-VL2 also showed strong performance, surpassing traditional deep learning models in several settings. Llama3.2-Vision, however, experienced a performance drop with cosine similarity due to its tendency to generate overly verbose descriptions, which negatively impacts semantic similarity scores compared to more concise outputs from DeepSeek-VL2 and InternVL2.5.

Also Read:

This work establishes a strong benchmark for integrating VLMs into intelligent healthcare systems. The descriptive caption dataset developed in this study is a valuable resource for fine-tuning VLMs and for more rigorous evaluation in this domain. The potential for VLMs to consolidate multiple functionalities into a single model could significantly reduce computational demands in assistive systems and Remote Health Monitoring Systems. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Vision Language Models Advance Human Activity Recognition in Healthcare

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates