TLDR: Researchers from UCLA developed a lightweight method to improve text legibility in Text-To-Video (T2V) models. They created a synthetic dataset by generating text-rich images with Text-To-Image (T2I) models and then animating them into videos using text-agnostic Image-To-Video (I2V) models. Fine-tuning the Wan2.1 T2V model with this data significantly improved short-text legibility and temporal consistency, and showed structural understanding for longer text, offering a practical solution to a common T2V challenge.
Generating videos from text descriptions has seen incredible advancements, but one persistent challenge remains: making sure any text within those videos is clear, readable, and consistent. Imagine a video where a sign says “Hello World,” but the letters are smudged, distorted, or change from frame to frame. This is a common issue with current Text-To-Video (T2V) models.
Researchers Ziyang Liu, Kevin Valencia, and Justin Cui from the University of California, Los Angeles, have tackled this problem with a novel, lightweight approach. Their work, detailed in their paper “Video Text Preservation with Synthetic Text-Rich Videos”, introduces a method to significantly improve how T2V models render text without requiring expensive architectural changes or extensive retraining.
The Challenge of Text in Videos
While Text-To-Image (T2I) models have made progress in generating legible text, these solutions are often too computationally demanding to adapt directly to video generation. Video models need to maintain consistency not just in a single frame, but across an entire sequence, which adds a layer of complexity. Current T2V models, despite their ability to create realistic motion and temporal coherence, frequently fail to produce accurate or readable text.
A Synthetic Solution
The core of the UCLA team’s method lies in using synthetic data for supervision. They developed a two-stage process to create a specialized dataset:
- First, they used Text-To-Image (T2I) diffusion models (like Stable Diffusion) to generate high-quality images that already contained clear, legible text. These images could feature anything from flags to product labels with specific words.
- Next, these text-rich images were fed into a “text-free” Image-To-Video (I2V) model. Crucially, the prompts given to the I2V model did not mention text. This prevented the I2V model from trying to generate its own text, which often leads to artifacts, and instead focused it on animating the existing image content smoothly.
This clever approach allowed them to create a dataset of video-prompt pairs where the videos contained coherent text, providing ideal training material.
Fine-Tuning for Fidelity
With this curated synthetic dataset, the researchers fine-tuned Wan2.1, a pre-trained T2V model known for its state-of-the-art performance in identity preservation. Wan2.1’s strength in maintaining visual consistency of subjects across frames made it an excellent candidate for preserving the high-frequency details of text. By fine-tuning Wan2.1 on their synthetic data, without altering its underlying architecture or loss functions, they aimed to teach the model how to handle text more effectively.
Promising Results
The evaluation, primarily qualitative due to the lack of standardized metrics for text fidelity in videos, showed significant improvements. The fine-tuned Wan2.1 model excelled at rendering short and simple text phrases, maintaining legibility and consistency across frames. This is a stark contrast to many baseline models, including advanced ones like Sora, which often produce garbled or smudged characters.
For longer sentences, while the model didn’t always preserve full word integrity, it demonstrated a remarkable ability to lay out plausible letterforms in sequence, creating coherent “text-like” patterns. This suggests the model learned fundamental structural patterns of typography, even when exact decoding was challenging.
Also Read:
- Enhancing Synthetic Infrared Images with Smart Inference Techniques
- Advanced AI Combines CNNs and Transformers for Sharper Scene Text
Looking Ahead
These findings highlight the potential of using curated synthetic data and weak supervision to enhance textual fidelity in video generation. The researchers suggest that future work should focus on developing standardized benchmarks for evaluating text quality in videos and exploring more diverse text scenarios. This lightweight pipeline offers a practical path forward for T2V models to finally master the art of legible text.


