TLDR: This paper introduces a framework that efficiently processes video data by combining various pre-trained AI models for multimodal content analysis. It transforms videos into a temporal, semi-structured format called VideoKnowledgeBase, which is then converted into a queryable, frame-level indexed knowledge graph. The framework supports continual learning, allowing users to dynamically add new domain-specific knowledge and retrieve video segments using multimodal queries.
Analyzing complex multimedia content, especially videos, has always been a significant challenge. Traditional methods often struggle to integrate diverse information from visual, auditory, and textual channels into a cohesive, structured, and easily searchable format. This paper introduces a groundbreaking framework designed to streamline the process of multimodal content analysis and understanding, transforming raw video data into an indexed knowledge graph.
The research, titled “From Videos to Indexed Knowledge Graphs – Framework to Marry Methods for Multimodal Content Analysis and Understanding,” was conducted by Basem Rizk, Joel Walsh, Mark Core, and Benjamin Nye from the University of Southern California. Their work addresses the need for a more comprehensive approach that can effectively combine various modalities and represent them in a structured, temporally aware manner, similar to sophisticated expert systems but with the added capability of continual learning.
A Three-Phase Approach to Video Understanding
The methodology presented in the paper unfolds in three distinct phases:
- Framework Construction: The first phase involved building a flexible framework that allows for the optimized composition of various pre-trained models. This framework is designed to efficiently process temporal multimodal data, such as videos, by enabling seamless integration and interaction between different analytical methods and models.
- Video to Semi-structured Data: Utilizing this framework, the researchers designed a specific pipeline. This pipeline’s purpose is to transform videos into a semi-structured data format, which they call ‘VideoKnowledgeBase’. This involves a series of steps using qualitatively selected pre-trained models and existing methods to extract meaningful information from the video content.
- Semi-structured Data to Knowledge Graphs: The final phase focuses on converting the generated VideoKnowledgeBases into ‘Video Knowledge Graphs’. These graphs are not only queryable but also extensible, meaning new domain-specific knowledge can be incorporated dynamically through interactive mini-classifiers.
How the Framework Operates
At the core of the framework is the concept of a ‘DataWindow’, a logical unit that encapsulates a segment of multimedia, such as a sequence of video frames aligned with a segment of transcription. These DataWindows flow through ‘Pipes’, which are processing components that wrap machine learning models for inference. A ‘PipeDirector’ guides the application of these pipes, ensuring data is correctly preprocessed and formatted. The entire process is orchestrated by a ‘Pipeline’ component, which can run pipes sequentially, in parallel, or in a loop, maximizing resource utility for near real-time performance.
The journey begins with a ‘DataWindowGenerator’ that takes a video, transcribes it using models like OpenAI’s Whisper, and segments it into coherent paragraphs. These segments, along with their aligned frames, are then packed into DataWindows.
The Video Processing Pipeline in Detail
Within the framework, a specific pipeline recipe is crafted to process videos:
- Keyframe Extraction: A ‘KeyFrameExtractor’ identifies representative keyframes from each video segment, adapting clustering techniques to select the clearest and most informative images.
- Content Recognition: Optical Character Recognition (OCR) is performed on these keyframes using ‘EasyOCR’ to detect text. Simultaneously, ‘RecognizeAnything’ (RAM) is used for image tagging to identify visible objects.
- Object Localization: The recognized objects and text are then used to prompt ‘GroundingDino’, which localizes these objects by detecting their corresponding bounding boxes.
- Dense Captioning: To provide rich descriptions, ‘HQEfficientSAM’ generates fine-grained masks for detected elements. A ‘CroppingObjectFocuser’ then crops images around these masked objects, which are subsequently captioned by a ‘Captioner’ employing models like ‘Blip’. These captions are merged to create dense descriptions for each frame.
- Relationship Extraction: Finally, a ‘SentenceGraphParser’ processes these dense captions. It employs scene graph parsing, co-reference resolution, and a concreteness filter to extract subjects, objects, and their relationships, forming clauses that describe the video content.
From VideoKnowledgeBase to Queryable Knowledge Graph
The output of this pipeline, the VideoKnowledgeBase, is a semi-structured collection of detected objects, their relations, and timestamps. This is then converted into a Video Knowledge Graph. Nodes in this graph correspond to ‘Synsets’ from the WordNet lexical database, each linked to specific frames where the knowledge is observed. Nouns and verbs extracted from transcriptions, tags, and captions define these nodes, with word sense disambiguation ensuring accurate semantic representation. Nodes are interconnected based on WordNet’s hierarchical relationships (hypernyms/hyponyms), creating a rich, multi-indexed graph for the entire video.
Enabling Continual Learning and Multimodal Queries
A key innovation is the ability to query these Video Knowledge Graphs using multimodal inputs (text, image, or video). The system converts the query into a graph format and matches it against the database of video graphs. Furthermore, the framework supports ‘VirtualSynsets’, allowing users to append new, domain-specific knowledge to the graph. For instance, a user could define “kn95 face mask” as a specific type of “face mask.” These new concepts are associated with mini-classifiers, which can be interactively trained with a small number of samples (e.g., 50 samples for a YOLOv8 model) to update the existing graphs in the background, enabling the system to adapt and learn continuously.
Also Read:
- Unlocking New Relations in Multimodal Knowledge Graphs with FusionAdapter
- Navigating Long Videos: TimeScope’s Method for Task-Oriented Event Localization
Future Prospects and Applications
This framework opens several avenues for future research and applications. Improvements can be made in handling OCR noise and generating more diverse captions by providing contextual information. Integrating visual and auditory cues could enhance video segmentation and word sense disambiguation. The framework also holds potential for generating datasets to train multimodal Large Language Models (LLMs) and specialized classifiers. Moreover, it can be adapted for Augmented Reality (AR) applications, providing context to socially intelligent agents and enabling more realistic and engaging interactions. For more technical details, you can refer to the full research paper here.


