TLDR: Next Visual Granularity (NVG) Generation is a new image generation framework that breaks down images into a structured sequence of visual granularities, from global layout to fine details. Unlike traditional methods, NVG iteratively refines images by generating both a structure map (layout) and corresponding content tokens, offering explicit control over the generation process. It achieves high-fidelity results and outperforms existing models like VAR, demonstrating a scalable and controllable approach to image synthesis.
In the rapidly evolving field of artificial intelligence, image generation models have made incredible strides, creating stunningly realistic and diverse visuals. However, many existing approaches often treat images as flat, unstructured data or as simple sequences of pixels, which can limit their ability to understand and control the intricate spatial relationships within an image. This often leads to challenges in achieving fine-grained control over the generation process and can result in accumulated errors.
A new research paper introduces a novel framework called Next Visual Granularity (NVG) Generation, which reimagines image synthesis by breaking down images into a structured sequence of visual granularities. Imagine an artist sketching a painting: they start with broad strokes for the overall layout, then add larger shapes, followed by smaller details, and finally, the fine textures. The NVG framework mimics this intuitive, coarse-to-fine progression, allowing for more natural and controllable image creation.
Understanding Visual Granularity
At the heart of NVG is the concept of a “visual granularity sequence.” Instead of viewing an image as a single entity, NVG decomposes it into multiple stages. Each stage represents the image at a different level of detail, using a varying number of unique visual “tokens” or building blocks. For instance, an early stage might use just a few tokens to capture the global layout, like separating the foreground from the background. Later stages would use many more tokens to define intricate details, such as the texture of fur or the individual leaves on a tree.
Crucially, each stage also includes a “structure map.” Think of this as a blueprint that dictates how these unique tokens are arranged across the image space. This map explicitly captures the image’s structure at different granularity levels, providing a clear guide for the generation process.
How NVG Works: A Step-by-Step Creation
The NVG framework operates through an iterative process, progressively refining an image from a blank canvas. It involves two main components working in tandem:
- Structure Generator: This component first creates the structure map for a given stage. It determines the overall layout and how different visual elements are grouped. For example, in early stages, it might define where the main object is located relative to the background.
- Content Generator: Once the structure map is in place, the content generator fills in the visual details. It predicts the unique tokens that correspond to the defined structure, gradually adding more visual information to the image.
This structured, iterative approach offers several key advantages. Unlike some models that generate an image all at once or in a rigid sequence, NVG allows for explicit control over different levels of detail. If you want to change the overall composition, you can adjust an early-stage structure map. If you want to refine a specific texture, you can focus on a later stage. This built-in control means less need for additional, post-hoc modules to guide generation.
Performance and Capabilities
The researchers trained NVG models on the ImageNet dataset for class-conditional image generation, meaning the models could generate images based on specific categories like “dog” or “cat.” The results were impressive: NVG consistently outperformed existing state-of-the-art models, particularly the VAR series, in terms of image quality and diversity metrics (like FID scores). The framework also demonstrated strong scalability, with performance improving as the model size increased.
Qualitative visualizations showcased NVG’s ability to produce diverse and high-fidelity images. The generated images aligned remarkably well with their corresponding binary structure maps, confirming that the model effectively uses the structural guidance. One particularly exciting capability is “structure-guided generation,” where users can provide simple geometric shapes or even structures from reference images to guide the creation of entirely new images with varied content. This means you could, for example, take the structural layout of a wallaby and generate a heron following that same pose.
The paper also highlights NVG’s strong error-correction ability. Even if early stages of generation are fixed to resemble one object (e.g., a dog), the model can still interpret the structure and content to generate an image of a completely different class (e.g., a Siamese cat or an Indian elephant), demonstrating its flexibility and robustness.
Also Read:
- Enhancing MLLM Accuracy: A New Method for Controlled Image Captioning
- Enhancing Visual Perception with Decoupled Learning: Introducing DeCLIP
The Future of Structured Image Generation
The Next Visual Granularity Generation framework represents a significant step forward in controllable image synthesis. By explicitly modeling hierarchical visual structure, it addresses a fundamental limitation of many current approaches. This work opens up exciting possibilities for future research, including region-aware generation (where specific parts of an image can be controlled), physical-aware video generation (tracking structured regions over time for more realistic videos), and hierarchical spatial reasoning. The code and models for this research will be released, paving the way for further exploration and application of this promising technology. You can find more details on the project page: Next Visual Granularity Generation Project Page.


