Unlocking Real-Time Video Understanding with StreamingTOM

TLDR: StreamingTOM is a new, training-free framework that makes video AI models much more efficient for live streaming. It tackles two major bottlenecks: reducing the computational cost before the AI model processes video frames (Causal Temporal Reduction) and managing the AI model’s memory more effectively by compressing and retrieving data on demand (Online Quantized Memory). This results in significant memory savings (15.7x compression), faster processing, and stable performance for long videos, making real-time video understanding practical.

In the rapidly evolving world of artificial intelligence, processing live video streams efficiently has remained a significant challenge. Traditional methods for video understanding, often designed for pre-recorded content, struggle with the unique demands of real-time streaming: the inability to look into the future (causality) and the ever-growing amount of data (accumulation). These issues lead to massive computational costs and memory bottlenecks, making real-time applications like autonomous driving or live assistants difficult to implement.

A new framework, called StreamingTOM (Streaming Token Compression), aims to solve these fundamental problems. Developed by researchers Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang from Westlake University, The Chinese University of Hong Kong, Zhejiang University, and SII, StreamingTOM is a training-free, plug-and-play solution designed to make video understanding models much more efficient for streaming scenarios. You can find the full research paper here.

Addressing the Core Bottlenecks

StreamingTOM tackles two critical bottlenecks that existing approaches often overlook or only partially address. Firstly, it focuses on the ‘pre-LLM’ stage, which is the initial processing of visual information before it even reaches the large language model (LLM). Current methods often process all visual tokens, leading to high computational costs. Secondly, it manages the ‘post-LLM’ stage, specifically the LLM’s memory (known as the kv-cache), which tends to grow without bound as more video frames arrive.

The framework is built on two main components:

Causal Temporal Reduction (CTR): This component works before the LLM. It’s designed to be strictly causal, meaning it only uses information from the current and past video frames, never future ones. CTR intelligently selects a fixed, compact subset of visual tokens from each frame, drastically reducing the amount of data the LLM needs to process. It does this by identifying changes between adjacent frames and focusing on the most informative parts of the current frame, ensuring predictable processing latency.
Online Quantized Memory (OQM): This component manages the LLM’s memory after the initial processing. OQM stores the processed tokens in a highly compressed 4-bit format, significantly reducing memory footprint. When the system needs to answer a question, it only retrieves and dequantizes the most relevant groups of tokens on demand. This ensures that the active memory used during decoding remains bounded, regardless of how long the video stream is.

Impressive Results and Practical Benefits

The combination of CTR and OQM yields remarkable efficiency gains. StreamingTOM achieves an impressive 15.7 times kv-cache compression ratio. Compared to previous state-of-the-art training-free methods, it delivers 1.2 times lower peak memory usage and 2 times faster ‘Time To First Token’ (TTFT), which is how quickly the model starts generating a response. For instance, a one-hour video stream, which would typically require 18.8 GB of kv-cache memory, is reduced to just 1.2 GB with StreamingTOM, confirming its ability to maintain bounded memory growth over extended sessions.

Crucially, these efficiency improvements do not come at the cost of accuracy. StreamingTOM maintains state-of-the-art accuracy among training-free methods, achieving an average of 63.8% on offline benchmarks and 55.8% accuracy with a 3.7 score on RVS streaming benchmarks. This demonstrates its robust capability across various temporal scales and real-time scenarios.

Also Read:

A Step Towards Real-Time Video AI

By addressing both the computational burden before the LLM and the memory accumulation after it, StreamingTOM offers a unified, training-free solution for efficient streaming video understanding. Its predictable latency and bounded memory growth make it a practical framework for deploying video LLMs in real-time, long-duration applications, paving the way for more responsive and capable AI systems in areas like autonomous driving, embodied AI, and live video assistants.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Real-Time Video Understanding with StreamingTOM

Addressing the Core Bottlenecks

Impressive Results and Practical Benefits

A Step Towards Real-Time Video AI

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates