TLDR: MedGen is a new medical video generation model trained on MedVideoCap-55K, the first large-scale, caption-rich dataset of over 55,000 diverse medical video clips. This development addresses the current models’ inability to produce medically accurate content due to a lack of specialized data. MedGen outperforms open-source models and rivals commercial systems in visual quality and medical accuracy, demonstrating significant potential for applications like data augmentation and medical simulations.
Generating realistic and accurate medical videos has long been a significant challenge in the field of artificial intelligence. While general video generation models have made impressive strides, they often fall short when it comes to the precise and sensitive nature of medical content, frequently producing unrealistic or erroneous visuals. This gap is primarily due to a severe lack of large-scale, high-quality datasets specifically designed for medical video generation.
Addressing this critical need, researchers have introduced MedVideoCap-55K, the first extensive and diverse dataset rich with captions for medical video generation. This groundbreaking dataset comprises over 55,000 carefully selected video clips that cover a wide array of real-world medical scenarios. These scenarios include clinical practice, medical imaging, educational content, medical animation, and even science popularization. Each clip in MedVideoCap-55K is paired with detailed, high-quality textual descriptions, making it an ideal foundation for training advanced medical video generation models.
Built upon this robust dataset, the researchers also developed MedGen, a specialized medical video generation model. MedGen has demonstrated remarkable performance, outperforming other open-source models and even competing effectively with commercial systems across various benchmarks. Its strengths lie in both visual quality and, crucially, medical accuracy. This means MedGen can generate videos that not only look good but also adhere to strict medical common sense, avoiding the anatomical inconsistencies or implausible scenarios often seen in videos generated by general-purpose models like Sora, Pika, or Hailuo.
The creation of MedVideoCap-55K involved a meticulous process. Starting with a vast collection of over 25 million public YouTube videos, a two-stage filtering pipeline was used to identify medically relevant content. This involved using a medical keyword dictionary and a text classifier. Further refinement included per-frame medical classification and temporal consistency checks to ensure only coherent, medically-focused segments were retained. High-quality captions were then generated for each clip using multimodal large language models (MLLMs), providing both brief and detailed descriptions.
To ensure the highest quality, the dataset underwent a rigorous second-stage filtration process. This step addressed subtle imperfections such as black borders, heavy subtitles, visual clutter, and technical artifacts that could hinder model learning. Filters for black border removal, subtitle detection via OCR, aesthetic quality assessment, and technical quality using Dover scores were applied, resulting in a clean and high-standard dataset.
MedGen’s capabilities extend beyond just generating high-quality medical videos. The research highlights its utility in practical applications, such as data augmentation for downstream medical video analysis tasks. By using MedGen-generated videos to expand training data, significant performance gains were observed in medical video classification benchmarks. This indicates that MedGen can serve as a valuable source of high-quality, domain-relevant data, especially in situations where real video data is scarce, sensitive, or expensive to acquire.
Also Read:
- MedGemma: Specialized AI Models Enhance Medical Vision and Language Understanding
- MML-SurgAdapt: A Unified AI Framework for Multi-Task Surgical Vision with Reduced Labeling
Furthermore, MedGen holds strong potential for applications in science popularization and user simulation. It can generate diverse videos for surgical training, patient education, and remote consultation, offering a versatile tool for various medical and healthcare needs. The development of MedGen and MedVideoCap-55K represents a significant step forward in making medical video generation more accessible, accurate, and useful for a wide range of applications. For more in-depth information, you can refer to the original research paper.


