TLDR: K-frames is a new method for understanding long videos by selecting important ‘key clips’ instead of individual frames. Developed by Yifeng Yao et al., it addresses limitations of current Multimodal Large Language Models (MLLMs) like context window constraints and computational costs. The system uses a new dataset called PeakClips, built through scene segmentation, hierarchical captioning, and LLM-guided relevance scoring. K-frames is trained via a three-stage curriculum involving supervised fine-tuning and reinforcement learning to predict query-relevant clips and enable flexible ‘any-k’ keyframe selection. This approach preserves temporal continuity, offers interpretability, and significantly improves MLLM performance on various video understanding benchmarks.
Large Language Models (LLMs) that can understand both text and images, known as Multimodal Large Language Models (MLLMs), have made incredible strides in interpreting visual information. However, when it comes to long videos, these powerful models face significant hurdles. Imagine trying to understand a feature-length film by looking at every single frame – it’s computationally expensive, and the sheer volume of data can overwhelm the model’s ‘context window,’ which is like its short-term memory.
The common approach of simply picking frames at regular intervals, called uniform sampling, often misses crucial information. It’s like trying to understand a story by reading only every tenth word; you’ll likely lose the plot. Other existing methods for selecting important frames, such as those based on text searches or complex optimization techniques, tend to pick frames that are scattered and don’t maintain the natural flow of events in a video. They also lack the flexibility to choose a varying number of frames, which is important for different tasks or computational budgets.
Introducing K-frames: A New Approach to Video Understanding
To tackle these challenges, researchers Yifeng Yao, Yike Yun, Jing Wang, Huishuai Zhang, Dongyan Zhao, Ke Tian, Zhihao Wang, Minghui Qiu, and Tao Wang have introduced a novel system called K-frames. This new method redefines how keyframes are selected for long video understanding. Instead of picking isolated frames, K-frames focuses on identifying ‘key clips’ – segments of video that are semantically meaningful and maintain temporal continuity. This ‘clip-first’ approach ensures that the narrative flow of events is preserved, making the selection process more interpretable and effective.
One of the standout features of K-frames is its ability to perform ‘any-k’ keyframe selection. This means it can flexibly select any desired number of keyframes to suit different user needs or computational constraints, a significant improvement over methods that are limited to a fixed number of frames.
Building the Foundation: The PeakClips Dataset
A major obstacle in developing scene-driven keyframe selection has been the lack of datasets with detailed scene-level relevance annotations. To overcome this, the team constructed a new, large-scale dataset called PeakClips. This dataset contains over 200,000 query-conditioned relevance annotations on video clips, providing the necessary information for K-frames to learn effectively.
The creation of PeakClips involved a sophisticated three-stage process:
- Scene Segmentation: Videos were first broken down into distinct, coherent scenes based on changes in visual content.
- Hierarchical Captioning: Detailed descriptions were generated for clips, chapters (groups of related clips), and the entire video, offering multi-level context.
- LLM-guided Relevance Scoring: An advanced LLM (Gemini 2.5 Pro) was used to assign relevance scores to each scene based on a given query, further refined by comparing frame-query similarity. This process identified ‘top-priority’ (P1) and ‘secondary-priority’ (P2) highlight clips.
How K-frames Learns: A Three-Stage Curriculum
K-frames is trained using a progressive three-stage curriculum, building its capabilities step by step:
- Stage 1 (Supervised Fine-Tuning – SFT): The model learns foundational skills like temporal grounding (aligning visual content with its time span) and scene understanding using the hierarchical captions and relevance annotations from PeakClips. It learns to locate scenes from descriptions and generate descriptions for given time spans.
- Stage 2 (Supervised Fine-Tuning – SFT): Building on the first stage, the model learns to perceive query-relevant video clips and provide reasons for their selection. This stage enables the core ‘clip2frame’ prediction.
- Stage 3 (Reinforcement Learning – RL): The SFT-trained model is then optimized using Reinforcement Learning. This stage directly aligns the clip selection policy with the performance on downstream tasks, such as answering questions about the video, without needing further manual annotations. This ensures the selected clips are maximally effective for real-world applications.
Also Read:
- Advancing Video Editing with In-Context Learning and Unpaired Video Clips
- Evaluating Multimodal Language Models for Face Recognition: A New Benchmark Reveals Performance Gaps
Demonstrated Effectiveness and Flexibility
Extensive experiments on major long-video understanding benchmarks have shown that K-frames significantly boosts the performance of various MLLMs, including both open-source models like Qwen2.5-VL and closed-source models like Gemini 2.5 Pro and GPT-4o. For instance, it dramatically improved accuracy on tasks requiring precise temporal localization, such as Needle QA, by up to 28.2% with Gemini 2.5 Pro.
The method is also ‘plug-and-play,’ meaning it can enhance existing MLLMs without requiring modifications to their architecture. Its ability to select any number of keyframes (any-k) and provide interpretable rationales for clip selection makes it a versatile and powerful solution for understanding long videos. Whether it’s focusing intensely on critical moments or maintaining a broader context, K-frames offers flexible sampling strategies to meet diverse needs.
In conclusion, K-frames represents a significant advancement in long-video understanding by shifting from individual frame selection to scene-driven clip prediction. This innovative paradigm, supported by the new PeakClips dataset and a robust training framework, offers an effective, interpretable, and adaptable solution for making sense of lengthy video content. You can read the full research paper here: K-FRAMES: SCENE-DRIVEN ANY-K KEYFRAME SELECTION FOR LONG VIDEO UNDERSTANDING.


