re-ground.github.io - ReGround: Improving Textual and Spatial Grounding at No Cost

Description: ReGround: Improving Textual and Spatial Grounding at No Cost

grounding (70) diffusion models (23) reground (3)

Example domain paragraphs

tl;dr: ReGround resolves the issue of description omission in GLIGEN [1] while accurately reflecting the bounding boxes, without any extra cost .

When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accurac

(a) Images generated by GLIGEN [1] while varying the activation duration of gated self-attention (γ) in scheduled sampling (Sec. 5.1). The red words in the text prompt denote the words used as labels of the input bounding boxes. Note that to reflect the underlined description in the text prompt in the final image, γ must be decreased to 0.1, which compromises spatial grounding accuracy. (b) In contrast, our ReGround reflects the underlined phrase even when γ=1.0, therefore achieving high accuracy in both te

Links to re-ground.github.io (2)