TLDR: InstructVTON is an advanced virtual try-on (VTO) system that uses natural language instructions to enable complex and fine-grained styling control for single or multiple garments. It features an innovative AutoMasker that automatically generates optimal, minimally-invasive masks based on user instructions and image segmentation, eliminating the need for manual mask drawing. The system also employs an agentic architecture to plan multi-garment try-on sequences and handle challenging styling scenarios, sometimes using intermediate ‘dummy garments’. InstructVTON demonstrates improved mask efficiency and maintains high image generation quality, offering a more intuitive and flexible user experience for virtual try-on applications.
Virtual try-on (VTO) technology has emerged as a powerful tool for online shopping and content creation, allowing users to visualize how garments would look on a person without needing a physical fitting. Traditionally, these systems often rely on users providing a precise binary mask to indicate where the garment should be placed on the human model. However, creating these masks can be challenging, requiring technical knowledge and often failing to achieve complex styling requests, such as rolling up sleeves or layering multiple garments with specific arrangements.
A new system called InstructVTON addresses these limitations by offering an instruction-following, interactive virtual try-on experience. It allows for fine-grained and complex styling control, guided by natural language, for single or multiple garments. This innovation simplifies the end-user experience by removing the need for manually drawn masks and automating complex multi-round image generation scenarios.
How InstructVTON Works: The Brains Behind the Try-On
InstructVTON is built on an agentic system that leverages Vision Language Models (VLMs) and image segmentation models. At its core, it has two main components: a Top-level Agent and a VTO Agent.
The Top-level Agent acts as a planner. When a user wants to try on multiple garments with a specific style instruction (e.g., “try on the shirt tucked in, jacket open”), this agent organizes the task. It determines the correct order for trying on each garment and summarizes the relevant style instruction for each step. For instance, it knows to try on a shirt before a jacket if the jacket is meant to be layered on top.
The VTO Agent then executes this plan step by step. For each garment, it receives the current human model image, the target garment image, and the summarized style instruction. This is where the innovative AutoMasker comes into play.
AutoMasker: Smart and Efficient Mask Generation
One of the most significant challenges in inpainting-based VTO is generating an effective mask. Traditional auto-masking solutions often create masks that cover more area than strictly necessary, potentially altering parts of the original image that should be preserved. InstructVTON’s AutoMasker proposes an optimal, minimally-invasive approach, aiming for high “mask efficiency” by preserving as many unmasked pixels as possible.
The AutoMasker achieves this by using two types of segmentation models: a Body Parts Segmentation Map (BPSM) model, which identifies human body parts (like torso, arms, legs), and a Clothing Segmentation Map (CSM) model, which identifies existing clothing on the person. By combining information from these maps with the target garment type and the natural language style instruction, the AutoMasker intelligently determines the precise area to mask. For example, if the instruction is to try on an overcoat, it might mask the area between the legs to create a natural-looking drape. If the instruction is “wear the jacket with buttons open,” it can remove a stripe from the center of the masking area to achieve an open-chest style, preserving the garment underneath.
In cases where a style instruction cannot be achieved in a single step (e.g., trying on a long-sleeve shirt with sleeves rolled up on a person already wearing a long-sleeve shirt), the VTO Agent employs a clever two-step approach. It might first use a “dummy garment” (like a tank top) to generate an intermediate image where the arms are exposed, and then apply the original target garment with the “sleeves rolled up” instruction to this intermediate image.
Performance and Interoperability
InstructVTON has been shown to be interoperable with existing state-of-the-art VTO models without requiring retraining or fine-tuning. Experiments demonstrate that it consistently achieves higher mask efficiency compared to other leading models, meaning it preserves more of the original human model image while delivering comparable or improved image generation quality. This is measured using metrics like Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS).
Also Read:
- SHINE: A Training-Free Framework for Realistic Image Composition
- CustomEnhancer: Advancing Personalized Photo Generation with Enhanced Scenes and Controls
Looking Ahead: Addressing Limitations
While InstructVTON represents a significant leap forward, the researchers acknowledge certain limitations. One primary concern is latency; complex multi-garment scenarios can take around a minute due to multiple calls to various intermediate AI models. Future work aims to address this by distilling the entire InstructVTON agent into a single, end-to-end model.
Another area for improvement is the granularity of body part segmentation, which currently limits the flexibility for very specific styling instructions (e.g., “rolling up sleeves to 3-quarters length”). Enhancing this granularity and incorporating advanced post-processing for masks will enable even more precise style control. Finally, the current agents operate as open-loop planners, meaning an error in an early step can propagate. Future research will explore modeling the agents with Markov decision processes and reinforcement learning to mitigate error propagation and handle even more complex and uncommon try-on scenarios.
InstructVTON marks an exciting advancement in virtual try-on technology, making it more intuitive, flexible, and capable of handling complex styling requests through the power of natural language and intelligent automation. You can read the full research paper here.


