Code2Video: AI Agents Craft Educational Videos Through Executable Code

TLDR: Code2Video is a new AI framework that generates high-quality educational videos by using executable Python code. It features three collaborative agents—Planner, Coder, and Critic—to structure content, write animation code, and refine visual layouts. Evaluated on the MMMC benchmark with a novel ‘TeachQuiz’ metric, Code2Video significantly outperforms pixel-based and direct code generation methods, demonstrating its effectiveness in knowledge transfer and producing videos comparable to human-crafted tutorials.

Creating high-quality educational videos is a complex task, demanding not only deep subject matter expertise but also precise visual structures and smooth transitions. While modern generative AI models have made strides in video synthesis, they often fall short in producing the kind of professional, instructionally effective content needed for learning. This is because educational videos require a level of explicit control over visual elements and temporal sequencing that pixel-based generation struggles to provide.

A new research paper introduces Code2Video, a novel framework that tackles this challenge by adopting a code-centric approach to educational video generation. Instead of directly synthesizing pixels, Code2Video generates executable Python code, specifically using the Manim animation library, to create videos. This method offers greater control, interpretability, and scalability, making it particularly well-suited for educational content.

How Code2Video Works: A Three-Agent System

The Code2Video framework operates through the collaboration of three specialized AI agents:

Planner: This agent is responsible for structuring the lecture content. It takes a learning topic and breaks it down into a coherent temporal flow, generating an outline and then a detailed storyboard. It also prepares corresponding visual assets, drawing from an external database to enhance factual accuracy and visual fidelity.
Coder: The Coder agent translates the structured instructions from the Planner into executable Python code. It works in parallel across different sections of the video to improve efficiency. A key feature is its ‘ScopeRefine’ debugging strategy, which intelligently fixes errors by focusing on specific lines or blocks of code, minimizing token usage and latency.
Critic: Even executable code can produce visually unsatisfactory results. The Critic agent refines the spatial layout and ensures clarity in the rendered video. It uses a unique ‘visual anchor prompt’ system, which discretizes the 2D canvas into a grid, allowing the AI to specify precise locations for elements. This transforms continuous positioning into a discrete problem, making it easier for the AI to provide actionable feedback and correct issues like overlapping elements or poor space utilization.

Evaluating Educational Effectiveness: The MMMC Benchmark and TeachQuiz

To systematically evaluate Code2Video, the researchers developed a new benchmark called MMMC (Massive Multi-discipline Multimodal Coding). This benchmark comprises professionally produced, discipline-specific educational videos, primarily sourced from the popular 3Blue1Brown YouTube channel, known for its high-quality Manim tutorials. MMMC covers 13 subject areas, from calculus to neural networks, providing a diverse and challenging dataset.

Beyond traditional aesthetic scores, Code2Video introduces a novel metric called TeachQuiz. This end-to-end metric quantifies how well a video transfers knowledge. It works by first ‘unlearning’ a target concept from a Vision-Language Model (VLM) and then measuring how effectively the generated video helps the VLM ‘relearn’ that knowledge. This isolates the video’s direct contribution to knowledge acquisition, ensuring that evaluation goes beyond mere visual appeal.

Also Read:

Promising Results and Future Directions

The evaluation results demonstrate the significant potential of Code2Video. Compared to direct code generation by large language models, the full Planner–Coder–Critic pipeline achieves a stable 40% improvement in aesthetic scores and a 46% improvement in TeachQuiz scores when using models like Claude Opus 4.1. The videos generated by Code2Video are even comparable to, and in some human studies, outperform professional human-made tutorials in TeachQuiz scores.

Pixel-based video generation models, such as OpenSora-v2 and Veo3, significantly underperform, struggling with text clarity, animation timing, and overall coherence—issues critical for educational content. The code-centric approach of Code2Video ensures sharper symbol layouts, consistent styles, and coherent narrative animations, which are vital for effective learning.

While human-made videos still lead in nuanced storytelling and explanatory depth, Code2Video significantly narrows the gap. The research highlights that structured visual guidance and iterative refinement are crucial for producing clear videos that effectively convey knowledge. Future work aims to broaden the scope of video generation and develop more lightweight, scalable agent frameworks. You can find more details about this innovative work at https://arxiv.org/pdf/2510.01174.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Code2Video: AI Agents Craft Educational Videos Through Executable Code

How Code2Video Works: A Three-Agent System

Evaluating Educational Effectiveness: The MMMC Benchmark and TeachQuiz

Promising Results and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

EBU Academy’s School of AI Honored with European Digital Skills Award for Upskilling Media Professionals

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates