Description: Virtual Try-All with image conditioned diffusion
diffusion (239) virtual try-on (12) diffuse to choose (2) virtual try-all (2)
As online shopping is growing, the ability for buyers to virtually visualize products in their settings—a phenomenon we define as "Virtual Try-All"—has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's detail
We present Diffuse to Choose , a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details.
We utilize a secondary U-Net Encoder to inject fine-grained details into the diffusion process. This begins with masking the source image and then inserting the reference image within the masked area. The resulting pixel-level 'hint' is subsequently adapted by a shallow CNN, aligning it with the VAE output dimensions of the source image, before element-wise added to it. Following this, a U-Net Encoder processes the adapted hint, where at each scale of the U-Net, a FILM module affinely aligns the skip-connec