TLDR: Voost is a new AI framework that uses a single diffusion transformer to perform both virtual try-on (putting clothes on a person) and virtual try-off (reconstructing the original garment from a dressed person). By training these tasks jointly, Voost improves garment-person interaction, leading to more realistic and accurate results across various poses and garment types. It also introduces inference-time techniques for better robustness and consistency, achieving state-of-the-art performance in both tasks.
The world of online fashion is constantly evolving, and virtual try-on technology is at the forefront of this transformation. Imagine being able to see how a garment looks on you without physically trying it on. This is the promise of virtual try-on (VTON), a generative AI task that creates a realistic image of a person wearing a target garment. However, accurately modeling how clothes fit and drape on a person, especially with different poses and appearances, has always been a significant challenge.
A new research paper introduces a groundbreaking framework called Voost, which aims to overcome these hurdles. Voost is a unified and scalable diffusion transformer that not only handles virtual try-on but also its inverse: virtual try-off. Virtual try-off is the task of reconstructing the original appearance of a garment from an image of a person wearing it. By learning both tasks simultaneously, Voost allows each garment-person pair to supervise both directions, significantly enhancing the AI’s understanding of garment-body relationships without needing separate networks, extra losses, or additional labels.
How Voost Works
At its core, Voost uses a single diffusion transformer, a powerful type of AI model, to learn both try-on and try-off. Unlike previous methods that might struggle with precise garment-person correspondence, Voost adopts a unique token-level concatenation structure. This means that the garment image and the person image are placed side-by-side and fed into a shared embedding space. This design allows the model to reason bidirectionally across both try-on and try-off scenarios using a common conditioning layout.
The framework is also highly scalable, supporting dynamic input layouts. This means it can handle diverse poses, aspect ratios, and spatial arrangements of images, making it robust for real-world applications. A special ‘task token’ tells the model whether to perform a try-on or try-off, and also specifies the garment category (e.g., upper, lower, full-body).
Smart Enhancements for Better Results
Voost introduces two clever techniques that refine its performance during inference (when the model is generating images):
- Attention Temperature Scaling: This technique helps the model adapt its focus when the input image resolution or mask size differs from what it was trained on. It ensures that the AI’s ‘attention’ remains sharp and relevant, especially when dealing with challenging layouts where the masked region might be small.
- Self-Corrective Sampling: This is a unique mechanism that leverages the model’s dual capability. During the image generation process, Voost can predict a dressed person image (try-on) and then use that prediction to perform a reverse try-off pass, reconstructing the original garment. By comparing this reconstructed garment to the actual conditioning garment, the model can iteratively refine its output, ensuring consistency and improving visual fidelity.
Impressive Performance
Extensive experiments show that Voost achieves state-of-the-art results on both virtual try-on and try-off benchmarks. It consistently outperforms existing methods in terms of alignment accuracy, visual fidelity, and generalization. A user study further confirmed its superiority, with participants consistently preferring Voost’s outputs for photorealism, garment detail, and garment structure.
The research also highlights the benefits of its joint training approach; the dual-task model significantly outperforms single-task models, indicating that learning both directions creates a more generalized understanding of garment-person interaction. Furthermore, the study found that fine-tuning only the attention modules within the transformer, rather than the entire model, achieved the best performance while significantly reducing training costs.
Also Read:
- FaceMat: A New Approach to Handling Occlusions in Face Transformations
- Enhancing Visual Grounding with Multiple Latent Expressions
Looking Ahead
While Voost marks a significant leap forward in virtual try-on and try-off technology, the researchers acknowledge areas for future improvement. Currently, precise control over garment fit can be ambiguous due to the lack of explicit structural or sizing information. Future work plans to incorporate additional cues like body measurements or garment metadata to enhance controllability. The strong foundation of Voost also makes it well-suited for extensions into video-based or 3D synthesis, promising even more immersive virtual fashion experiences.
For more technical details, you can read the full research paper here.


