spot_img
HomeResearch & DevelopmentVision Language Models Advance Human Activity Recognition in Healthcare

Vision Language Models Advance Human Activity Recognition in Healthcare

TLDR: This research introduces a new method and dataset for evaluating Vision Language Models (VLMs) in dynamic human activity recognition (HAR) for remote health monitoring. It demonstrates that VLMs can achieve performance comparable to, and sometimes better than, traditional deep learning models, offering greater flexibility and efficiency for healthcare applications by interpreting patient activities and supporting natural language interactions.

In the evolving landscape of generative AI, Vision Language Models (VLMs) are showing significant promise, particularly in healthcare. A recent study explores their application in human activity recognition (HAR) for remote health monitoring, an area that has been relatively underexplored. This research highlights the flexibility and capabilities of VLMs to overcome limitations of traditional deep learning models in this critical field.

Remote health monitoring is becoming increasingly vital, especially with an aging global population. The goal is to develop intelligent systems that can continuously monitor patients while upholding their privacy. By encoding visual data and using AI models to interpret patient activities, these systems can allow clinicians to query models with questions like “What is the patient doing?”, making HAR a key component for enhancing healthcare delivery.

Traditional deep learning models for HAR often require extensive labeled datasets and are limited to a fixed set of predefined activity classes. Integrating separate HAR models into broader AI-assisted monitoring systems can also be inefficient. VLMs, however, offer a different approach. Trained on vast multimodal datasets, they can generate detailed and flexible descriptions of patient activities, generalizing across a wide range of actions without being confined to predefined labels. This allows them to recognize and describe activities not explicitly seen during training, leveraging their generative and contextual reasoning abilities.

A significant challenge in applying VLMs to HAR has been the difficulty in evaluating their dynamic and often non-deterministic outputs. To address this, the researchers introduced a descriptive caption dataset and proposed comprehensive evaluation methods. They created a caption-based dataset from the Toyota Smarthome video dataset, specifically tailored for visual-text alignment in healthcare monitoring. This dataset includes descriptive textual captions for each video, generated using a framework that integrates a VLM (GPT-4o) to create captions from visual inputs and ground-truth labels, ensuring alignment through an iterative keyword integration process.

The study employed four evaluation approaches: Keyword Matching, VLM-as-Judge, BERTScore, and Cosine Similarity. After an initial phase to assess reliability, Keyword Matching and Cosine Similarity were identified as the most dependable metrics. BERTScore was found to be misleading due to its broad focus on token similarity, while VLM-as-Judge showed lower-than-expected performance, though it positively indicated the ground-truth dataset was not biased towards GPT-4o’s outputs.

Comparative experiments were conducted against state-of-the-art deep learning models. The findings demonstrated that VLMs achieved comparable, and in some cases, superior performance in terms of accuracy. Notably, open-source VLMs like Llama3.2-Vision, DeepSeek-VL2, and InternVL2.5, despite not being explicitly trained on the dataset and using only two keyframes per video, showed competitive results. Llama3.2-Vision, for instance, surpassed several deep learning models in certain evaluations using keyword matching.

When evaluated with the cosine similarity method, which is considered a fairer assessment for VLMs due to its focus on semantic similarity, all VLMs achieved higher Mean Class Accuracy (MCA) scores. InternVL2.5 achieved the highest MCA at 83.8% in the cross-subject evaluation, outperforming all listed deep learning models. DeepSeek-VL2 also showed strong performance, surpassing traditional deep learning models in several settings. Llama3.2-Vision, however, experienced a performance drop with cosine similarity due to its tendency to generate overly verbose descriptions, which negatively impacts semantic similarity scores compared to more concise outputs from DeepSeek-VL2 and InternVL2.5.

Also Read:

This work establishes a strong benchmark for integrating VLMs into intelligent healthcare systems. The descriptive caption dataset developed in this study is a valuable resource for fine-tuning VLMs and for more rigorous evaluation in this domain. The potential for VLMs to consolidate multiple functionalities into a single model could significantly reduce computational demands in assistive systems and Remote Health Monitoring Systems. For more details, you can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -