TLDR: Researchers introduced two new ControlNet modules for Flow Matching-based diffusion models: a Proportion ControlNet using bounding boxes for object placement and scale, and a Perspective ControlNet using vanishing lines for 3D scene geometry. These modules provide artists with higher-level control over image generation, trained using automated data pipelines. While effective for their respective tasks (especially 1- and 2-point perspectives), they show limitations with complex constraints like 3-point perspectives and require careful guidance strength management when used together.
Modern text-to-image diffusion models have made incredible strides in generating realistic and complex images. However, artists and creators often face a significant challenge: precisely controlling the spatial arrangement and geometric structure of the elements within these generated images. While a simple text prompt can guide the overall scene, it offers limited fine-grained control over where objects appear or how perspective is rendered.
Addressing this limitation, a new research paper introduces two specialized ControlNet modules designed to give artists more intuitive and high-level control over image generation. These modules, the Proportion ControlNet and the Perspective ControlNet, extend the capabilities of Flow Matching-based diffusion models like FLUX.1-dev, allowing for more deliberate artistic expression.
Proportion Control with Bounding Boxes
The Proportion ControlNet empowers users to dictate the placement and scale of objects using simple bounding boxes. Unlike regional prompting, which assigns different text prompts to specific masked areas, this method uses a single global prompt. The bounding boxes merely define the regions where elements described in the global prompt should appear, giving the model creative freedom to interpret and fill those spaces. This approach is also distinct from low-level controllers like Canny or LineArt, which focus on exact contours. Bounding boxes offer a higher level of abstraction, defining the semantic space an object should occupy rather than its precise shape, making it easier for artists to apply compositional rules like the rule of thirds without needing detailed outlines.
Perspective Control with Vanishing Lines
For defining the 3D geometry and viewpoint of a scene, the Perspective ControlNet utilizes vanishing lines. The researchers found that while vanishing points mathematically define perspective, they are problematic as conditioning inputs because they can be at infinity or far outside the image canvas, making them imprecise and difficult for users to manipulate. Vanishing lines, on the other hand, are intuitive for artists to draw, mimicking the natural sketching process of defining convergence. Crucially, these lines are always contained within the canvas, providing a direct, unambiguous, and spatially grounded proxy for vanishing points, thus offering a more robust and user-friendly input for perspective control.
Automated Data Pipelines for Training
To train these specialized ControlNets, the researchers developed fully automated data pipelines. For the Proportion ControlNet, the pipeline processed the WikiArt dataset, filtering images for aesthetic quality, then using Florence-2 for captioning and Grounding DINO for detecting object bounding boxes. The Perspective ControlNet’s pipeline processed a subset of OpenImages v7, also with aesthetic filtering, and employed a 2-Line Exhaustive Search algorithm to identify images with strong perspective structures, followed by Florence-2 for captioning. It’s noted that the perspective dataset was heavily skewed towards 1-point perspectives.
Also Read:
- Video-As-Prompt: A Unified Framework for Semantic Video Generation
- ReconViaGen: Enhancing 3D Object Reconstruction with Generative and Reconstruction Priors
Experimental Insights and Limitations
Experiments demonstrated that the Proportion ControlNet effectively adheres to bounding box constraints and even showed an emergent ability to interpret non-rectangular shapes as proportional guides, likely due to its LineArt initialization. However, training on WikiArt introduced a “pictorial” style bias that intensified with higher ControlNet guidance strength.
The Perspective ControlNet successfully generated scenes respecting 1- and 2-point perspectives. A notable limitation was its consistent failure to render 3-point perspectives, often ignoring vertical convergence—a problem attributed to the skewed training data. The model also exhibited a strong prior for straight horizons, requiring explicit textual prompting (e.g., “Top view”) to achieve non-standard views like Dutch angles.
When attempting to use both ControlNets simultaneously, the researchers observed that optimal guidance strengths for individual modules led to image degradation, including severe color artifacts and “mushy” textures. Stable generation required reducing the guidance strength of each module to approximately 0.5, achieving partial adherence to both constraints, though robust and precise combined control proved challenging.
This work represents a significant step towards providing artists with more sophisticated and intuitive tools for controlling generative AI models. The researchers conclude that future work should focus on improving data diversity, potentially through synthetic generation from 3D scenes, to overcome current limitations. Both models are openly available on HuggingFace for wider access and experimentation. You can read the full research paper for more details here.


