InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

Amazon
UCLA, work done as an applied scientist intern at Amazon
Duke University, work done as an applied scientist intern at Amazon
optimal_mask
Automasker of InstructVTON generates optimal mask for Virtual try-on task.
instructvton
InstructVTON iteratively generates virtual try-on result with multiple garments.

Abstract

We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditoned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with “sleeves rolled up” styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.

automasker
AutoMasker uses VLM to determine optimal masking area.

AutoMasker of InstructVTON obtains body and clothing segmentation maps of the human model image, it then uses a VLM to choose necessary segementations that can satisfy the user-provided style instruction while minimizing the masked area and thus maximize the preservation of the original human model image.

dummy-garment-step
InstructVTON intelligently use dummy garment to satisfy hard-to-achieve style instruction.

When presented with style instruction that is hard to achieve with single execution of Inpainting VTO, such as "sleeves rolled-up" while target garment is a long-sleeve shirt and human model image is also wearing a long-sleeve shirt, InstructVTON intelligently adopts a 2-step approach, first it generates an intermediate try-on image with a dummy garment with short sleeves, then generates the final try-on result with the target garment.

BibTeX

If you find our work useful, please cite our paper:

@misc{Instruct-VTON,
      author = {Julien Han, Shuwen Qiu, Qi Li, Xingzi Xu, Kavosh Asadi, Amir Tavanaei, Karim Bouyarmane},
      title = {InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On},
      publisher = {ArXiv},
      year = {2024},
      primaryClass={cs.CL}
}