Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks.
We show visual examples of how blob-grounded generation (BlobGEN) can capture fine-grained details of real images, in particular those with irregular objects, large objects and background. For example, the first three examples in (a) show that blob representations can capture "a person with waving hands outside the blob". the last two examples in (a) show that blob representations can capture "a river with irregular shapes". Also, the fourth example in (b) also shows that blob representations can capture "the great wall with a zigzag shape". As for large objects and background, the first two rows in (b) show that blob representations can capture "sky" with the similar color and mood. The second row in (b) show that blob representations can capture the large "pier" with a similar color, pattern and shape. The third row in (b) show that blob representations can capture the large "foggy grass" and its reflection in the water. The last two rows in (b) show that blob representations can capture the large mountains in the background.
To further demonstrate that blob representations can capture more fine-grained information than bounding boxes, we provide qualitative results of comparing blob-grounded generation (Ours) and box-grounded generation (GLIGEN [1]). We can clearly see that generated images from blob representations can better reconstruct fine-grained details of the reference image than those from bounding boxes. For instance, in the first example, the laptop's orientation and the two books' orientation and color in the generated image from our method closely follow the reference image while GLIGEN does not. In the second example, the car's color and orientation and the grass and road's appearance and orientation from our method also more closely follow the reference image. In the third example, the two giraffes' orientation is better captured by blob representations than bounding boxes, with a direct comparison to the reference. In the last example, the two girls' appearance and pose, the couch's color, and the orientation and appearance of the counter and chair in the background are also better captured by blob representations.
We visualize various image editing results by manually changing a blob representation (e.g., either blob description or blob parameter) while keeping other blobs the same. Our method can enable different local editing capabilities by solely changing the corresponding blob descriptions. In particular, we can change the object color, category, and texture while keeping the unedited regions mostly the same. Furthermore, our method can also make different object repositioning tasks easily achievable by solely manipulating the corresponding blob parameters. For instance, we can move an object to different locations, change an object's orientation, and add/remove an object while also keeping other regions nearly unchanged. Note that we have not applied any attention map guidance or sample blending trick during sampling, to preserve localization and disentanglement in the editing process. Thus, these results demonstrate that a well-disentangled and modular property naturally emerges in our blob-grounded generation framework.
Blob representations can be generated by LLMs from a global caption. To demonstrate the blob control from LLMs, we consider four cases of how LLMs understand compositional prompts for correct visual planning: (a) swapping object name ("cat" <-> "car"), (b) changing relative reposition ("left" <-> "right"), (c) changing object number ("three" <-> "four"), and (d) swapping object number ("one bench & two cats" <-> "two benches & one cat"). We can see that LLMs have the ability of capturing the subtle differences when the prompts are modified in an "adversarial" manner. Besides, LLMs can generate diverse and feasible visual layouts from the same prompt, which BlobGEN can use for robustly synthesizing correct images of high quality and diversity.
We show the numerical and spatial reasoning results of our method and previous approaches in the following two figures, respectively. In addition, we visualize the layouts inferred by GPT4 for both GLIGEN and our method. We can see that all methods without layout planning in the middle always fail in the task, including SDXL that has a large model size and more advanced training procedures, and Attention-and-Excite that has an explicit attention guidance. Compared with LayoutGPT based on GLIGEN, our method can not only generate images with better spatial and numerical correctness, but also in general has better visual quality with less ``copy-and-paste'' effect.
We qualitatively compare our method and LMD [3] on the NSR-1K benchmark. For a fair comparison, we use the same 8 fixed in-context demo examples (without retrieval) for GPT-4 to generate bounding boxes for LMD and blobs for our method, respectively. We observe that 1) for some complex examples, such as "a boat to the right of a fork" in (a) and "there are one car with three motorcycles in the image" in (b), LMD fails but our method works; 2) LMD consistently has the "copy-and-paste" artifact in its generated images (since it modifies the diffusion sampling process for compositionality), such as "a teddy bear to the left of a potted plant" in (a) and "a photo of four boats" in (b), while our generation looks much more natural; and 3) the sampling time of LMD is around 3x slower than our method.
[1] Y. Li, et al. "GLIGEN: Open-set grounded text-to-image generation.", CVPR 2023.
[2] W. Feng, et al. "LayoutGPT: compositional visual planning and generation with large language models.",
NeurIPS 2023.
[3] L. Lian, et al. "LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion
Models with Large Language Models.", TMLR 2023.
Compositional Text-to-Image Generation with Dense Blob Representations
Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat
@inproceedings{nie2024BlobGEN,
title={Compositional Text-to-Image Generation with Dense Blob Representations},
author={Nie, Weili and Liu, Sifei and Mardani, Morteza and Liu, Chao and Eckart, Benjamin and Vahdat, Arash},
booktitle = {International Conference on Machine Learning (ICML)},
year={2024}
}