
We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditoned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with “sleeves rolled up” styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.
AutoMasker of InstructVTON obtains body and clothing segmentation maps of the human model image, it then uses a VLM to choose necessary segementations that can satisfy the user-provided style instruction while minimizing the masked area and thus maximize the preservation of the original human model image.
When presented with style instruction that is hard to achieve with single execution of Inpainting VTO, such as "sleeves rolled-up" while target garment is a long-sleeve shirt and human model image is also wearing a long-sleeve shirt, InstructVTON intelligently adopts a 2-step approach, first it generates an intermediate try-on image with a dummy garment with short sleeves, then generates the final try-on result with the target garment.
If you find our work useful, please cite our paper:
@misc{Instruct-VTON,
author = {Julien Han, Shuwen Qiu, Qi Li, Xingzi Xu, Kavosh Asadi, Amir Tavanaei, Karim Bouyarmane},
title = {InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On},
publisher = {ArXiv},
year = {2024},
primaryClass={cs.CL}
}