TLDR: A new research paper introduces HIPPO-VIDEO, a dataset created using an LLM-based simulator to generate realistic user watch histories for personalized video highlighting. The paper also proposes HiPHer, a method that leverages these histories to predict user-specific video highlights, outperforming traditional generic or query-based approaches by better capturing complex user preferences.
In today’s digital age, the sheer volume of video content available is overwhelming. From educational tutorials to entertainment, users are constantly searching for relevant information. However, what one person finds important in a video might be completely different from another’s preference. This highlights a critical need for personalized video highlighting, a task that aims to identify and present the most relevant segments of a video tailored to an individual user’s interests.
Traditional video summarization and highlight detection methods often fall short in this regard. They typically rely on generic approaches or simple text queries, which fail to capture the complex and evolving nature of human preferences. Imagine trying to summarize a long documentary for someone interested in historical facts versus someone focused on cinematic techniques – a one-size-fits-all approach simply doesn’t work.
To address this challenge, researchers Jeongeun Lee, Youngjae Yu, and Dongha Lee from Yonsei University have introduced a groundbreaking new dataset called HIPPO-VIDEO. This dataset is designed specifically for personalized video highlighting and was created using an innovative approach: an AI-powered user simulator based on Large Language Models (LLMs). This simulator generates realistic ‘watch histories’ that reflect diverse user preferences, overcoming the privacy concerns and resource limitations associated with collecting real user data.
The HIPPO-VIDEO dataset is substantial, comprising 2,040 pairs of (watch history, saliency score), encompassing a total of 20,400 videos across 170 semantic categories. Each watch history consists of 10 videos, providing a rich context for understanding user interests. The LLM-based simulator mimics real user behavior by iteratively updating preferences as it ‘watches’ videos. This process involves initializing user profiles with specific topics and intents, retrieving video candidates (either related videos or new search queries), engaging with videos by selecting the most and least preferred ones, and then dynamically updating its long-term preferences based on these interactions.
After simulating a watch history, the last video in the sequence becomes the ‘target video’ for saliency annotation. The simulator then assigns a relevance score from 1 to 10 to each segment of this target video. These scores are determined by integrating the simulator’s final long-term preferences and its personal reviews of the video, ensuring that the highlights are truly aligned with the inferred user interests.
To validate the realism and reliability of HIPPO-VIDEO, the researchers conducted extensive human verification studies. Human annotators assessed the plausibility of the simulator’s generated queries and video selections. Remarkably, 97.56% of the queries were deemed reasonable, and the simulator’s video choices matched human selections in over 71% of cases. Further tests using advanced AI models like GPT-4 showed that simulated watch histories were often indistinguishable from real ones, achieving only 40% accuracy in binary classification, which is below a random baseline. This strong validation underscores the dataset’s potential as a reliable proxy for real-world user behavior.
Alongside the dataset, the researchers also propose a method called HiPHer (History-Driven Preference-Aware Video Highlighter). HiPHer leverages these personalized watch histories to predict segment-wise saliency scores. By deriving a global preference embedding from the watch history and using cross-attention to guide segment representations, HiPHer significantly outperforms existing generic and query-based approaches in experiments. This demonstrates the power of incorporating detailed user histories for more effective and user-centric video highlighting in practical scenarios.
Also Read:
- Advancing Long-Form Video Analysis with Controllable Hybrid Captioning
- Unlocking Video Anomaly Detection with MLLMs’ Hidden Insights
The findings from this research emphasize the critical role of history-driven preference modeling for personalized video experiences. By moving beyond simple queries or generic summaries, HIPPO-VIDEO and HiPHer pave the way for more intelligent and user-adaptive video content delivery systems. For more technical details, you can refer to the full research paper: HIPPO-VIDEO: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting.


