VideoNSA: A Smart Approach to Scaling Video Understanding in AI Models

TLDR: VideoNSA is a new method that uses Native Sparse Attention (NSA) to significantly improve how AI models understand long videos. It combines three attention mechanisms (compression, selection, and sliding window) with dynamic gating to efficiently process video data, scaling up to 128K tokens while using only 3.6% of the full attention budget. This allows it to outperform existing methods in long-video understanding, temporal reasoning, and spatial understanding, effectively managing computational complexity and attention sinks.

Understanding long videos has long been a significant challenge for advanced AI models, particularly multimodal language models (MLLMs). These models often struggle with the sheer volume of information, leading to issues like missing crucial transition frames and losing coherence over extended periods. Traditional approaches, such as simply increasing the number of frames sampled, lead to an explosion in computational complexity and quickly hit the limits of a model’s context length. Other methods, like token compression, try to reduce redundancy but can sometimes lead to irreversible information loss, especially in complex reasoning tasks.

A new research paper introduces an innovative solution called VideoNSA: Native Sparse Attention Scales Video Understanding. This work, by Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, and Zhuowen Tu, adapts a powerful mechanism known as Native Sparse Attention (NSA) specifically for video-language models. The core idea behind VideoNSA is to intelligently focus the model’s attention on the most relevant parts of a video, rather than processing every single piece of information uniformly.

How VideoNSA Works: A Hybrid Approach

VideoNSA employs a clever hybrid attention mechanism. For text inputs, it maintains a standard, dense attention approach to ensure precise instruction following. However, for video inputs, it leverages Native Sparse Attention. This sparse attention mechanism is ‘hardware-aware’ and ‘learnable,’ meaning it’s designed to be efficient on computing hardware and can adapt its focus during training. Instead of computing attention between all possible key-value pairs, NSA dynamically builds a smaller, information-rich subset of data for each query.

This dynamic selection is achieved through three complementary branches, each with a specific role, and a learnable ‘gate’ that adaptively weights their contributions:

Compression (CMP) Branch: This branch aggregates blocks of video frames into more concise, block-level representations. Think of it like summarizing sections of a video to reduce redundancy while keeping the main points.
Selection (SLC) Branch: This branch identifies and preserves the most important or ‘salient’ key-value blocks. It computes importance scores and picks out the top-ranked blocks, ensuring that critical moments are not overlooked.
Sliding Window (SWA) Branch: This branch focuses on local temporal coverage, similar to how humans might pay close attention to recent events. It retains a fixed number of the most recent key-value pairs, ensuring that immediate context is always considered.

These three branches work together, with the learnable gate deciding how much to weigh each one for optimal performance on a given task. VideoNSA builds upon the Qwen2.5-VL-7B model, using Qwen2.5-7B as its language model decoder, which also incorporates Grouped-Query Attention (GQA) for efficient processing of text.

Training and Performance

VideoNSA was trained end-to-end on a substantial dataset of 216,000 video instruction pairs, a filtered subset of LLaVA-Video-178K. While trained with a maximum context length of 36,000 tokens, the model demonstrated remarkable scalability, effectively handling contexts up to 128,000 tokens.

The experimental results are compelling. VideoNSA consistently outperformed existing token compression and training-free sparse attention methods across a range of benchmarks. It showed improved performance in:

Long Video Understanding: Evaluated on benchmarks like LongVideoBench, MLVU, TimeScope, and LongTimeScope, VideoNSA achieved competitive results, especially on ultra-long videos, some spanning up to 10 hours.
Temporal Reasoning: On the Tomato benchmark, which assesses various reasoning types and video scenarios, VideoNSA achieved the highest accuracy, highlighting its ability for fine-grained temporal inference.
Spatial Understanding: In VSIBench, which focuses on spatial reasoning, VideoNSA matched the strongest sparse attention baselines and significantly surpassed token compression methods, confirming its ability to preserve spatial detail.

Key Insights from Scaling and Analysis

The researchers conducted extensive analysis, revealing several important findings:

Benefit of Learned Sparse Weights: Even when applied in dense attention settings, the learned weights from VideoNSA provided a beneficial ‘inductive bias,’ improving performance on several tasks. This suggests that the model learns effective attention distributions.
Context Length Scalability: VideoNSA effectively extrapolates to contexts far beyond its training length, scaling reliably to 128,000 tokens. However, the ideal balance between tokens per frame and total frames is highly task-dependent. For instance, LongVideoBench benefits from more tokens per frame, while TimeScope and Tomato prefer more frames for better temporal coverage.
Optimal Attention Budget Allocation: The model’s performance is highly sensitive to how the attention budget is allocated. Configurations close to the training settings generally yield the best results. Interestingly, increasing ‘global’ attention (more blocks) tends to be more effective than simply enlarging the ‘local’ sliding window. Remarkably, VideoNSA achieves leading performance using only 3.6% of the full attention budget.
Dynamic Roles of Branches: Each of the three attention branches plays a distinct role across different layers of the model. The compression branch generally maintains high importance, crucial for redundancy reduction. The selection and sliding window branches are more active in early and middle layers but diminish in later layers as the model focuses on aggregating high-level features.
Efficiency Bottleneck: The compression branch, while vital, was identified as the primary computational bottleneck as the context length grows, indicating an area for future optimization.
Managing Attention Sinks: Attention sinks are a common issue in transformers where some tokens disproportionately absorb attention. VideoNSA’s dynamic gating mechanism effectively counteracts the negative effects of the compression branch (which tends to produce more sinks), maintaining a low overall sink ratio of 0.3%. This leads to smoother temporal coverage and avoids over-reliance on early positions, a common problem in dense attention models.

Also Read:

Conclusion

VideoNSA represents a significant step forward in video understanding for multimodal language models. By intelligently combining block-wise compression, salient block selection, and a sliding window mechanism through learnable gates, it efficiently processes ultra-long video contexts while preserving critical information. The research demonstrates that this hybrid sparse attention approach offers a powerful and scalable framework, paving the way for more capable video foundation models. To learn more, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

VideoNSA: A Smart Approach to Scaling Video Understanding in AI Models

How VideoNSA Works: A Hybrid Approach

Training and Performance

Key Insights from Scaling and Analysis

Conclusion

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Baidu Unveils Next-Generation AI Accelerators and ERNIE 5.0 Model

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates