TLDR: A new research paper introduces a ‘Controllable Hybrid Captioner’ (CHC) that enhances long-form video understanding. This system efficiently generates both action and scene descriptions by fine-tuning a single model (LaViLa-CHC) to alternate between caption types based on detected scene changes. This approach creates a richer, text-based memory of video content, leading to improved accuracy in answering complex questions about long videos compared to traditional action-only captioning or multi-model systems.
Understanding long-form video content, which can be incredibly dense and high-dimensional, presents a significant challenge in the field of artificial intelligence. Traditional methods often struggle to scale efficiently to longer durations, as the computational demands for analyzing every moment become immense. This limitation makes it difficult for AI systems to grasp complex narratives, relationships between actions, actors, and objects that unfold over extended periods.
To address this, researchers have explored creating text-based summaries of video content. These summaries, often called “caption logs,” offer a much more compact way to represent relevant information from videos. Crucially, these textual representations are easily processed by large language models (LLMs), which can then perform sophisticated reasoning to answer complex questions about the video content.
The Challenge of Long-Form Video
Current approaches for video understanding typically focus on short clips, often just a few seconds long, primarily for tasks like human action recognition. While effective for these specific tasks, these algorithms are not designed to handle the intricacies of long-form videos, which require understanding a sequence of events and their broader context. The sheer volume of data in long videos also makes detailed spatio-temporal modeling computationally impractical.
A Novel Approach: Building Text-Based Memory
A recent research paper introduces an innovative framework that tackles long-form video understanding by progressively building a compact, text-based memory of observed activity. This memory is constructed by a video captioner that operates on shorter, manageable chunks of the video, where detailed analysis is more feasible. The core idea is to create a log of events that occur during the video, which can then be used to answer natural language questions.
The system builds upon the LaViLa (Language-augmented Video Language Pretraining) captioner, which learns video-language representations. While previous models like LaViLa and LLoVi primarily focused on generating “action captions” (describing human activities), the new research recognized that questions about videos might also pertain to static elements or the overall scene. Therefore, a key improvement involves enriching these activity logs with “static scene descriptions” using Vision Language Models (VLMs).
Two Paths to Hybrid Captioning
The researchers explored two main approaches to integrate scene information:
- Ensemble Video Captioner: This initial method pairs the LaViLa captioner with a separate VLM, specifically LLaVA (available in 7B and 34B parameter sizes). When a scene change is detected in the video, the LLaVA VLM is prompted to describe the scene, focusing on objects and their properties. This creates a more comprehensive caption log by combining action and scene details.
- Controllable Hybrid Video Captioner (LaViLa-CHC): To enhance efficiency, the researchers fine-tuned the LaViLa narrator itself to generate both action and scene captions from a single model. This “hybrid” model learns to switch between generating action and scene descriptions based on special input tokens (
[ACX]for action and[SCX]for scene). This significantly streamlines the captioning pipeline compared to using two separate models. Despite its smaller size (LaViLa’s LLM, GPT-2-medium, has 137 million parameters compared to LLaVA’s billions), the LaViLa-CHC proved effective, with techniques like repetition penalty applied to ensure quality in longer scene descriptions.
To determine precisely when to add scene information, various temporal segmentation methods were investigated, including uniform sampling, PySceneDetect, and Kernel Temporal Segmentation (KTS). These methods help identify significant scene changes within the video, triggering the generation of a scene description.
Performance and Impact
The effectiveness of this new framework was evaluated on the EgoSchema dataset, a benchmark for long-form video language understanding comprising over 5,000 multiple-choice question-answer pairs over 250 hours of video. The results demonstrated that incorporating additional scene information into the video caption log consistently improved the accuracy of the system in answering questions. For instance, the LaViLa-CHC model, especially when using uniform sampling for scene detection, showed strong performance, often outperforming the ensemble system while requiring significantly less memory.
Beyond question-answering accuracy, the researchers also assessed the lexical and semantic similarity of the generated captions. The action captions produced by LaViLa-CHC showed higher semantic similarity to ground-truth captions, suggesting that the additional training for scene captions also benefited action description quality. While scene captions had lower lexical similarity (likely due to their longer, more varied nature), their semantic similarity remained strong.
Also Read:
- Unlocking Video Answers: How LeAdQA Uses AI to Find Key Moments
- Unlocking Robust Video Object Segmentation with Concept-Driven AI
Looking Ahead
This work presents a robust framework for long-form video understanding that intelligently combines action and scene descriptions. By detecting scene changes and using a single, controllable hybrid captioner, the system achieves improved performance and efficiency. This approach opens doors for future advancements, such as exploring even smaller LLMs for question answering by providing them with an even richer and more detailed caption log. The full research paper can be found here.


