InstructVTON: Intelligent Virtual Try-On with Language-Guided Styling

TLDR: InstructVTON is an advanced virtual try-on (VTO) system that uses natural language instructions to enable complex and fine-grained styling control for single or multiple garments. It features an innovative AutoMasker that automatically generates optimal, minimally-invasive masks based on user instructions and image segmentation, eliminating the need for manual mask drawing. The system also employs an agentic architecture to plan multi-garment try-on sequences and handle challenging styling scenarios, sometimes using intermediate ‘dummy garments’. InstructVTON demonstrates improved mask efficiency and maintains high image generation quality, offering a more intuitive and flexible user experience for virtual try-on applications.

Virtual try-on (VTO) technology has emerged as a powerful tool for online shopping and content creation, allowing users to visualize how garments would look on a person without needing a physical fitting. Traditionally, these systems often rely on users providing a precise binary mask to indicate where the garment should be placed on the human model. However, creating these masks can be challenging, requiring technical knowledge and often failing to achieve complex styling requests, such as rolling up sleeves or layering multiple garments with specific arrangements.

A new system called InstructVTON addresses these limitations by offering an instruction-following, interactive virtual try-on experience. It allows for fine-grained and complex styling control, guided by natural language, for single or multiple garments. This innovation simplifies the end-user experience by removing the need for manually drawn masks and automating complex multi-round image generation scenarios.

How InstructVTON Works: The Brains Behind the Try-On

InstructVTON is built on an agentic system that leverages Vision Language Models (VLMs) and image segmentation models. At its core, it has two main components: a Top-level Agent and a VTO Agent.

The Top-level Agent acts as a planner. When a user wants to try on multiple garments with a specific style instruction (e.g., “try on the shirt tucked in, jacket open”), this agent organizes the task. It determines the correct order for trying on each garment and summarizes the relevant style instruction for each step. For instance, it knows to try on a shirt before a jacket if the jacket is meant to be layered on top.

The VTO Agent then executes this plan step by step. For each garment, it receives the current human model image, the target garment image, and the summarized style instruction. This is where the innovative AutoMasker comes into play.

AutoMasker: Smart and Efficient Mask Generation

One of the most significant challenges in inpainting-based VTO is generating an effective mask. Traditional auto-masking solutions often create masks that cover more area than strictly necessary, potentially altering parts of the original image that should be preserved. InstructVTON’s AutoMasker proposes an optimal, minimally-invasive approach, aiming for high “mask efficiency” by preserving as many unmasked pixels as possible.

The AutoMasker achieves this by using two types of segmentation models: a Body Parts Segmentation Map (BPSM) model, which identifies human body parts (like torso, arms, legs), and a Clothing Segmentation Map (CSM) model, which identifies existing clothing on the person. By combining information from these maps with the target garment type and the natural language style instruction, the AutoMasker intelligently determines the precise area to mask. For example, if the instruction is to try on an overcoat, it might mask the area between the legs to create a natural-looking drape. If the instruction is “wear the jacket with buttons open,” it can remove a stripe from the center of the masking area to achieve an open-chest style, preserving the garment underneath.

In cases where a style instruction cannot be achieved in a single step (e.g., trying on a long-sleeve shirt with sleeves rolled up on a person already wearing a long-sleeve shirt), the VTO Agent employs a clever two-step approach. It might first use a “dummy garment” (like a tank top) to generate an intermediate image where the arms are exposed, and then apply the original target garment with the “sleeves rolled up” instruction to this intermediate image.

Performance and Interoperability

InstructVTON has been shown to be interoperable with existing state-of-the-art VTO models without requiring retraining or fine-tuning. Experiments demonstrate that it consistently achieves higher mask efficiency compared to other leading models, meaning it preserves more of the original human model image while delivering comparable or improved image generation quality. This is measured using metrics like Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS).

Also Read:

Looking Ahead: Addressing Limitations

While InstructVTON represents a significant leap forward, the researchers acknowledge certain limitations. One primary concern is latency; complex multi-garment scenarios can take around a minute due to multiple calls to various intermediate AI models. Future work aims to address this by distilling the entire InstructVTON agent into a single, end-to-end model.

Another area for improvement is the granularity of body part segmentation, which currently limits the flexibility for very specific styling instructions (e.g., “rolling up sleeves to 3-quarters length”). Enhancing this granularity and incorporating advanced post-processing for masks will enable even more precise style control. Finally, the current agents operate as open-loop planners, meaning an error in an early step can propagate. Future research will explore modeling the agents with Markov decision processes and reinforcement learning to mitigate error propagation and handle even more complex and uncommon try-on scenarios.

InstructVTON marks an exciting advancement in virtual try-on technology, making it more intuitive, flexible, and capable of handling complex styling requests through the power of natural language and intelligent automation. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

InstructVTON: Intelligent Virtual Try-On with Language-Guided Styling

How InstructVTON Works: The Brains Behind the Try-On

AutoMasker: Smart and Efficient Mask Generation

Performance and Interoperability

Looking Ahead: Addressing Limitations

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates