TLDR: A new AI system uses natural language prompts to autonomously edit long, story-driven videos. It creates a detailed semantic index of the video content, then uses specialized AI agents to plan, retrieve, and render edits based on user prompts. Evaluations show it significantly improves usability, comprehension, and task success for long-form video editing compared to existing AI tools, offering a hands-free approach to complex video production.
Video editing, especially for long, story-rich content, has always been a demanding task. Creators often face significant cognitive challenges in sifting through hours of footage, storyboarding, and sequencing. Traditional editing software offers precise timeline controls, and even recent large multimodal models (LMMs) provide summarization, but they often fall short in automating the high-level, story-centric work required to shape a narrative.
A new research paper introduces a novel solution: a prompt-driven, agentic video editing system designed to autonomously understand and restructure multi-hour, story-driven media. This system aims to reduce the cognitive load on creators by allowing them to direct complex edits using free-form natural language prompts, rather than intricate timeline manipulations.
At its core, the system builds a persistent semantic index of the video content. This index is created through a sophisticated pipeline involving hierarchical temporal segmentation, guided memory compression, and cross-granularity fusion. This process generates detailed, interpretable traces of the plot, dialogue, emotions, and context within the video, all precisely aligned with timestamps.
The interaction with this system is entirely natural language-based. Users can issue broad commands, such as “Summarize this lecture as a 3-minute explainer,” or highly specific instructions like “Exclude scenes with Character A, use somber background music, emphasize emotional transitions.” The system then generates a coherent plan, intermediate artifacts like storyboards and narration scripts, and ultimately, a polished video edit, all without human intervention.
The architecture is modular, employing specialized agents for planning, retrieval, and rendering. These agents operate over the semantic index, allowing for parallel exploration of different editing versions and styles. This modularity ensures scalability, observability, and configurability in the editing workflow.
The system’s effectiveness was rigorously tested on over 400 long-form videos, including narrative films, conference keynotes, interviews, and sports broadcasts. Evaluations involved expert ratings, structured question answering, and human preference studies. The results showed significant improvements in semantic fidelity, narrative coherence, and overall usability compared to existing methods.
How the System Works: A Three-Stage Pipeline
The pipeline consists of three main stages: video comprehension and semantic indexing, video-centric question answering, and prompt-driven video response generation. All subsequent modules rely on the structured, temporally grounded, and semantically enriched text index created in the first stage.
1. Video Comprehension and Semantic Indexing
This phase transforms multimodal video input into a compact, interpretable textual representation. It begins with temporal segmentation, breaking videos into overlapping 15-minute segments for coarse-grained narrative construction and finer 5-minute scenes for detailed semantic extraction. Gemini 2.0 Flash is used for initial comprehension, extracting high-level information like media format, setting, premise, and key characters. A crucial aspect is guided context compression, where each segment’s summary is distilled to manage token limitations and track evolving plotlines and character arcs efficiently.
Following this, fine-grained scene comprehension processes 5-minute scenes, guided by the draft synopsis and character graph. This helps in accurate speaker attribution, character continuity, and motivation inference. It extracts dialogue, cinematographic descriptors, and emotional signals, assigning timestamps at regular intervals or semantic boundaries. A final refinement pass reconciles local and global outputs, correcting inconsistencies and enriching descriptions to create a comprehensive, self-auditing index.
2. Video-Centric Question Answering
For question answering, the system uses the structured index, formatting it into a memory-efficient prompt. An auxiliary agent determines if visual evidence is needed, retrieving relevant clips using timestamp filters and semantic similarity. This allows for precise answers to queries like “When did the protagonist express doubt?”
3. Prompt-Driven Video Response Generation
This is where the cinematic video editing happens. A high-level planning agent interprets the user’s prompt, considering tone, perspective, and scope, to generate a structured storyboard. A narration agent then converts each storyboard segment into a naturalistic voiceover script, which guides the retrieval agent in finding the most semantically relevant visual segments from the indexed video corpus. These selected clips are then compiled into a structured video editing plan, specifying time ranges and rendering modes.
The final rendering stage is orchestrated by a formatting agent, which combines video clips, narration audio, subtitles, and music. Tools like FFmpeg and MoviePy are used for sequencing, and ElevenLabs TTS for high-quality voiceovers. Specialized agents handle beat alignment, dynamic cropping, and micro-cut refinement to ensure a polished, professional output.
Also Read:
- ChronoForge-RL: A Smarter Way for AI to Understand Videos
- Enhancing Meeting Summaries: A Fact-Based and Personalized Approach
Evaluation Highlights
The system was evaluated across four studies. Study 1 showed that the agentic pipeline significantly outperformed Gemini 2.0 Flash and 2.5 Flash in both quality and usability of generated indexes, and achieved significantly higher usability than Gemini 2.5 Pro. Study 2, a question-answering task, confirmed the critical role of the refinement stage, with the full pipeline achieving the highest scores for correctness, timestamp accuracy, and detail.
In Study 3, comparing AI-edited movie recaps with human-edited ones, human edits were more professional, but the AI-generated recaps were comparably engaging in genre alignment and watchability. Finally, Study 4, a user experience study, found that participants were significantly more successful in creating long-form recaps with the agentic pipeline compared to Descript or OpusClip. For short-form tasks, it outperformed OpusClip and was comparable to Descript.
This research highlights the power of modular, agentic orchestration in video editing, offering a glimpse into the future of AI-assisted creative workflows. For more details, you can read the full research paper here.


