TLDR: AdCare-VLM is a new AI model based on Video-LLaVA that uses patient videos to monitor medication adherence for chronic diseases like tuberculosis. It was fine-tuned on a private dataset of 806 expert-annotated TB medication videos, creating the LLM-TB-VQA dataset. The model identifies visual cues like face, medication, water, and ingestion to determine adherence patterns. AdCare-VLM outperforms existing vision-language models in accuracy and contextual understanding, offering an automated solution to reduce clinician workload and improve patient outcomes, though it requires more diverse datasets and computational resources for broader implementation.
Medication adherence is a critical factor in managing chronic diseases like diabetes, hypertension, HIV/AIDS, and tuberculosis. Unfortunately, many patients struggle to consistently take their prescribed medications, leading to worsening conditions, increased healthcare costs, and even preventable deaths. Traditional methods of monitoring adherence, such as directly observed therapy (DOT), are often resource-intensive and impractical, especially in remote areas or regions with healthcare worker shortages. While video-assisted directly observed therapy (VOT) offers a more flexible alternative, it still requires extensive manual review of videos by clinicians, which can be time-consuming and prone to human error.
To address these challenges, researchers have developed AdCare-VLM, an innovative artificial intelligence system designed to automate the monitoring of long-term medication adherence using patient videos. This specialized Large Vision Language Model (LVLM) leverages advanced AI to analyze video footage and answer questions related to whether a patient has taken their medication correctly.
How AdCare-VLM Works
AdCare-VLM is built upon a framework called Video-LLaVA, which allows it to understand and process both visual and linguistic information simultaneously. The model is trained to identify key visual cues in patient videos that indicate medication intake. These cues include the clear visibility of the patient’s face, the medication itself, water intake, and the actual act of ingestion. By correlating these visual features with medical concepts, AdCare-VLM can determine adherence patterns.
A crucial aspect of AdCare-VLM’s development involved fine-tuning it on a unique and private dataset. This dataset comprises 806 custom-annotated videos specifically for tuberculosis (TB) medication monitoring. Clinical experts meticulously labeled these videos, categorizing them into positive (medication taken), negative (no medication taken), and ambiguous (unclear adherence) cases. This detailed annotation process created LLM-TB-VQA, a comprehensive medical adherence video question answering dataset.
Key Features and Performance
The AdCare-VLM model integrates images, videos, and text with a robust large language model foundation. It uses a technique called “pre-alignment to projection” to map videos and images into a shared feature space, allowing the AI to learn from a unified visual representation. This means the model can effectively understand the same information, whether it’s presented as text, an image, or a video.
Experimental results show that AdCare-VLM outperforms other parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi. It demonstrated significant improvements in accuracy across various configurations, including pre-trained, regular, and low-rank adaptation (LoRA) setups. The model particularly excels in contextual and temporal understanding, providing a more nuanced interpretation of patient actions and their environment.
For instance, the model can identify if a patient is holding a pill, drinking water, and swallowing, and then determine if these actions constitute positive adherence. This level of detail helps in automating repetitive monitoring tasks, reducing the workload for healthcare professionals, and potentially improving the quality of care.
Also Read:
- Evaluating Video Model Accuracy: Introducing MESH for Hallucination Measurement
- Enhancing Melanoma Diagnosis with Retrieval-Augmented Vision-Language Models
Looking Ahead
While AdCare-VLM shows promising results, the researchers acknowledge certain limitations and future directions. The need for more large-scale, open-access annotated datasets, especially in diverse contexts like Africa, is crucial for further advancement. Addressing data distribution disparities and potential biases (gender, socio-economic, cultural) is also essential for equitable and effective implementation. The current model also has moderate capability in understanding very long videos, as it relies on uniformly sampled frames, which might miss intricate details.
Despite these challenges, AdCare-VLM represents a significant step forward in digital health. By leveraging generative AI and vision-language models, it offers a powerful tool for predicting and monitoring medication adherence, ultimately contributing to better health outcomes for patients with chronic diseases. The source code and pre-trained weights for this research will be made accessible for further development and exploration. You can find more details about this research in the full paper available here.


