TLDR: ASK-HINT is a new method that uses detailed, action-specific questions to guide AI vision-language models in detecting unusual events in videos. Unlike abstract prompts, ASK-HINT’s fine-grained approach improves accuracy, provides clear explanations for its decisions, and works effectively without needing extensive training, making it a highly adaptable solution for surveillance and other applications.
Detecting unusual or abnormal events in video streams, known as Video Anomaly Detection (VAD), is a critical task with wide-ranging applications, from autonomous driving to surveillance monitoring. While simply identifying an anomaly is important, practical systems also need to explain why an event is considered abnormal. Recent advancements in Vision-Language Models (VLMs), which combine powerful visual understanding with language reasoning, show great promise for VAD, offering natural language explanations.
However, current methods often use very general or abstract prompts to guide these VLMs. These prompts frequently miss the subtle, fine-grained details of human-object interactions or specific actions that truly define complex anomalies in real-world surveillance footage. For instance, an abstract prompt might fail to recognize a robbery because it doesn’t specifically look for actions like ‘property being taken’ or ‘physical confrontation’.
Introducing ASK-HINT: A Smarter Way to Prompt VLMs
To address this limitation, researchers have introduced a novel framework called ASK-HINT. This structured prompting approach leverages action-centric knowledge to extract more accurate and understandable reasoning from existing, pre-trained VLMs. The core idea is to move beyond vague questions and instead use detailed, specific inquiries that align the model’s predictions with clear visual cues.
ASK-HINT organizes its prompts into semantically meaningful groups, such as ‘violence’, ‘property crimes’, or ‘public safety incidents’. Within these groups, it formulates fine-grained guiding questions. For example, instead of asking ‘Is there an anomaly?’, it might ask ‘Do you see punching, kicking, or wrestling on the ground?’ or ‘Is there any fire or smoke?’. This level of detail helps the VLM focus on specific visual evidence, leading to better detection and more interpretable explanations.
How ASK-HINT Works
The framework operates in three main steps:
1. Class-Wise Prompt Construction: Initially, a pool of fine-grained, action-focused questions is created for each type of anomaly. These questions are designed to target concrete visual actions or human-object interactions relevant to that specific anomaly class.
2. Semantic Compression via Prompt Selection: A key insight of ASK-HINT is that many anomaly types share common underlying actions. For example, ‘setting fire’ is relevant to both ‘Arson’ and ‘Explosion’. The framework uses the VLM itself to identify and cluster semantically related prompts. These clusters are then summarized into a compact set of 2-3 generalized guiding questions for each group (e.g., ‘Violence or Harm to People’, ‘Crimes Against Property’, ‘Public Safety Incidents’). This compression makes the process more efficient and reduces the risk of the model getting confused by too many irrelevant prompts.
3. Structured Inference with Explanation Trace: During detection, the VLM uses this compact set of guiding questions. It first makes a binary decision: is the video ‘Normal’ or ‘Abnormal’? If abnormal, it then assigns the video to one of the predefined semantic groups and provides a concise reason based on the specific questions it answered ‘yes’ to. This structured output provides a clear, human-auditable explanation for the anomaly detection.
Also Read:
- Making Sense of AI: Generating Understandable Explanations for Video Summaries
- GroundSight: Enhancing Visual Question Answering with Focused Attention and Hallucination Control
Key Advantages and Performance
ASK-HINT offers several significant benefits:
- Improved Accuracy: Extensive experiments on datasets like UCF-Crime and XD-Violence show that ASK-HINT consistently outperforms previous methods, including both training-free and some fine-tuned approaches, achieving state-of-the-art performance.
- Enhanced Interpretability: By providing detailed reasoning traces aligned with fine-grained actions, the framework makes VLM decisions transparent and understandable.
- Training-Free and Generalizable: ASK-HINT works with frozen (pre-trained) VLMs without requiring any additional training or fine-tuning. This makes it highly adaptable and capable of generalizing across different datasets and even to previously unseen anomaly categories.
The research demonstrates that the granularity of prompts plays a critical role in unlocking the full reasoning capabilities of VLMs for video anomaly detection. By focusing on specific actions rather than abstract labels, ASK-HINT establishes a new, efficient, and explainable solution for this challenging task.
For more technical details, you can refer to the full research paper: Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting.
While ASK-HINT represents a significant step forward, the authors acknowledge limitations, such as its reliance on a static prompt set and its current lack of explicit temporal modeling. Future work will explore dynamic, context-aware prompting and incorporating more sophisticated temporal reasoning to further enhance its capabilities in complex, evolving environments.


