arxiv:2309.13042

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Published on Sep 22, 2023

· Submitted by

akhaliq on Sep 25, 2023

#3 Paper of the day

Upvote

Authors:

Jiahao Xie ,

Wei Li ,

Xiangtai Li ,

Ziwei Liu ,

Abstract

We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at https://github.com/Jiahao000/MosaicFusion.

View arXiv page View PDF Add to collection

Community

osanseviero

Sep 25, 2023

Here is a ML-generated summary

Objective
The paper proposes MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation.

Insights

Dividing the image into regions and generating multiple objects in one image is more effective than generating single objects.
Allowing a certain overlap between regions results in a smoother transition and better image quality.
Appending a category definition to the prompt reduces ambiguity and improves accuracy.
Aggregating attention maps across all layers and time steps captures both coarse shapes and fine details.
Higher quality diffusion models like Stable Diffusion produce better instance segmentation performance.
MosaicFusion consistently improves various detection architectures and is complementary to other data augmentation techniques.

Implementation

Divide the image canvas into multiple regions and define text prompts for each region specifying the object to generate.
Map the image canvas from pixel space to latent space based on the upscaling factor of the diffusion model.
Run the diffusion process on each latent region in parallel using a shared noise prediction model. Initialize with the same noise but condition each region on the corresponding text prompt.
Aggregate the cross-attention maps for each token across layers and time steps by upsampling and averaging.
Threshold the aggregated attention maps and refine the masks using edge-aware algorithms like bilateral solver.
Filter low-quality masks based on connected components and keep those with one component.
Expand the filtered region masks to the full canvas size to produce the final instance masks.

Results
MosaicFusion significantly boosts performance on long-tailed and open-vocabulary instance segmentation benchmarks, especially for rare and novel categories.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.13042 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.13042 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.13042 in a Space README.md to link it from this page.