TLDR: video-SALMONN S is a new streaming audio-visual large language model (LLM) designed to understand extremely long videos (over 3 hours) efficiently. It achieves this by using a novel test-time training (TTT) memory module that continuously updates token representations to retain long-term context, and a prompt-dependent memory reader that selectively retrieves relevant information. This allows it to process vast amounts of video data under a fixed memory budget, outperforming existing offline and streaming video understanding models.
In the rapidly evolving landscape of artificial intelligence, the ability for AI agents to continuously process and understand long video streams at high frame rates and resolutions is becoming increasingly vital. However, current video-understanding Large Language Models (LLMs) face significant challenges when dealing with extended video content. Traditional methods often rely on a fixed number of frames, leading to substantial information loss for longer videos, or employ token compression techniques that can discard crucial details.
Addressing these limitations, researchers have introduced a groundbreaking new model called video-SALMONN S. This innovative streaming audio-visual LLM is designed to overcome the length constraints of previous models, offering a solution for processing videos that span multiple hours while operating within a fixed memory budget.
A New Approach to Long-Term Memory
The core of video-SALMONN S lies in two key innovations that enhance its long-term memory capabilities. Firstly, it features a unique **Test-Time Training (TTT) memory module**. Unlike methods that merge or discard tokens, which can lead to information loss, the TTT module continually updates token representations. This allows the model to capture and retain long-range dependencies throughout the video stream. To make this adaptation efficient, the module employs a sophisticated Hessian-free conjugate-gradient procedure (TTTHF).
Secondly, video-SALMONN S incorporates a **prompt-dependent memory reader**. This intelligent mechanism selectively retrieves only the context-relevant content from its fixed-size memory based on the user’s specific prompt or query. This means the model doesn’t need to process all stored information at once, significantly improving efficiency and relevance, especially when dealing with vast amounts of historical data from a long video.
Unprecedented Capabilities
The capabilities of video-SALMONN S are impressive. It is the first audio-visual LLM known to process videos exceeding three hours in length, maintaining a resolution of 360p and a frame rate of 1 frame per second (FPS), all while adhering to a consistent memory footprint. This translates to understanding multi-hour videos with over 10,000 frames and approximately 1 million tokens, a feat previously challenging for AI systems.
The model’s performance has been rigorously evaluated on several long-video benchmarks, including Video-MME, LVBench, and VideoEvalPro. On these tests, video-SALMONN S consistently demonstrated high-quality understanding. Its 8-billion-parameter version achieved an impressive 74.2% overall accuracy on Video-MME, with a notable 67.8% on the challenging long-video partition. These results indicate that it outperforms both existing offline and other streaming baseline models.
Also Read:
- Unlocking Real-Time AI Perception for Endless Video Streams
- OmniVideoBench: A New Benchmark for Advanced Audio-Visual AI Understanding
How it Works (Simplified)
In essence, as video frames arrive, they are first converted into encodings. These encodings then pass through the TTTHF layer, which intelligently integrates historical information into their representations. A fixed-size long-term memory is maintained by discarding tokens that are highly similar to their neighbors, but the crucial information from these discarded tokens is still preserved by the TTT HF layer. Audio information is processed separately and then combined. When a user provides a prompt, the prompt-dependent reader sifts through this memory, selecting only the most pertinent information to generate a response.
This innovative design allows video-SALMONN S to handle the continuous, real-time nature of video streams where the total length is unknown, a scenario where traditional offline LLMs with limited attention spans often fall short. By continually updating its memory and selectively retrieving information, video-SALMONN S mitigates the significant information loss typically associated with processing extremely long videos.
For more in-depth information, you can refer to the full research paper: video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory.


