spot_img
HomeResearch & DevelopmentAdvancing Image Generation with Vision Foundation Models as Efficient...

Advancing Image Generation with Vision Foundation Models as Efficient Visual Tokenizers

TLDR: VFMTok is a new image tokenizer that uses frozen vision foundation models (VFMs) to create compact, semantically rich visual tokens. It employs region-adaptive quantization and a semantic reconstruction objective, leading to superior image reconstruction and generation quality, faster autoregressive model convergence (3x speedup), and high-fidelity class-conditional synthesis without classifier-free guidance (4x inference speedup).

A new research paper titled “Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation” introduces an innovative approach to image generation. Authored by Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi from institutions including The University of Hong Kong, StepFun, Dexmal, and MEGVII Technology, this work explores how powerful pre-trained vision foundation models (VFMs) can be directly used to create highly efficient image tokenizers.

Traditionally, autoregressive (AR) image generation models, similar to how GPT generates language, rely on visual tokenizers to convert images into compact, discrete “tokens.” However, existing tokenizers, often trained from scratch, tend to produce latent spaces that are rich in low-level details but lack high-level semantic understanding and contain a lot of redundant information. This inefficiency can slow down the training of AR models and often requires additional techniques like classifier-free guidance (CFG) to achieve high-quality, class-specific image generation, which further increases inference time.

The researchers posed a compelling question: Can the rich, semantic features learned by vision foundation models like DINOv2 and CLIP, originally designed for visual understanding, also serve as robust representations for image reconstruction and generation? Their pilot studies showed promising results, indicating that features from frozen pre-trained VFMs could indeed support effective image reconstruction and even surpass the generative performance of fully trained VQGAN encoders.

Introducing VFMTok: A Novel Image Tokenizer

Building on these insights, the team developed VFMTok, an image tokenizer that leverages a frozen pre-trained VFM for adaptive, region-level tokenization. VFMTok is designed to achieve high reconstruction and generation quality while significantly improving token efficiency. It works by using a frozen VFM as an encoder to extract multi-level semantic features from an image. Instead of using a rigid 2D grid, VFMTok employs a region-adaptive sampling mechanism with learnable “anchor queries” and deformable attention. This allows it to identify and aggregate features from semantically coherent, irregular regions, effectively reducing redundancy and creating more compact, meaningful tokens.

To ensure the quality and semantic fidelity of these tokens, VFMTok incorporates two key components. First, a region-adaptive quantization framework reduces redundancy in the pre-trained features. Second, a semantic reconstruction objective aligns the tokenizer’s outputs with the foundation model’s representations, preserving the rich semantic content. The model also uses a shared lightweight Vision Transformer (ViT) for both reconstructing the original image pixels and reconstructing the VFM’s high-level features, which helps maintain semantic integrity and reduces model parameters.

Also Read:

Remarkable Results and Efficiency Gains

The experimental results for VFMTok are impressive. It achieves superior image reconstruction and generation performance while using significantly fewer tokens. For instance, VFMTok uses only 256 tokens to represent an image, compared to 576 tokens used by some prior methods like LlamaGen’s VQGAN variant. Despite this reduction, VFMTok achieved a strong rFID (reconstruction Fréchet Inception Distance) of 0.89 and an rIS (reconstruction Inception Score) of 215.4, outperforming other tokenizers and demonstrating better semantic consistency in reconstructions.

For class-conditional image generation, VFMTok-based models showed competitive performance against mainstream models, including diffusion models. Notably, VFMTok-XXL, with 1.4 billion parameters, achieved a gFID (generation Fréchet Inception Distance) of 1.95 without classifier-free guidance (CFG), outperforming LlamaGen-3B (3.1 billion parameters) which achieved 9.38 gFID without CFG. This highlights VFMTok’s ability to enable high-fidelity, CFG-free image synthesis, which significantly accelerates inference time.

Furthermore, VFMTok accelerates AR model training convergence by three times compared to VQGAN. The reduced number of tokens also leads to a four-fold generation speedup over counterparts like DINOv2-VQGAN and CLIP-VQGAN. This efficiency, combined with high-quality output, makes VFMTok a promising advancement in autoregressive image generation.

The researchers plan to release the code publicly to benefit the community. For more details, you can read the full research paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -