MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
Abstract
We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at https://github.com/Jiahao000/MosaicFusion.
Community
Here is a ML-generated summary
Objective
The paper proposes MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation.
Insights
- Dividing the image into regions and generating multiple objects in one image is more effective than generating single objects.
- Allowing a certain overlap between regions results in a smoother transition and better image quality.
- Appending a category definition to the prompt reduces ambiguity and improves accuracy.
- Aggregating attention maps across all layers and time steps captures both coarse shapes and fine details.
- Higher quality diffusion models like Stable Diffusion produce better instance segmentation performance.
- MosaicFusion consistently improves various detection architectures and is complementary to other data augmentation techniques.
Implementation
- Divide the image canvas into multiple regions and define text prompts for each region specifying the object to generate.
- Map the image canvas from pixel space to latent space based on the upscaling factor of the diffusion model.
- Run the diffusion process on each latent region in parallel using a shared noise prediction model. Initialize with the same noise but condition each region on the corresponding text prompt.
- Aggregate the cross-attention maps for each token across layers and time steps by upsampling and averaging.
- Threshold the aggregated attention maps and refine the masks using edge-aware algorithms like bilateral solver.
- Filter low-quality masks based on connected components and keep those with one component.
- Expand the filtered region masks to the full canvas size to produce the final instance masks.
Results
MosaicFusion significantly boosts performance on long-tailed and open-vocabulary instance segmentation benchmarks, especially for rare and novel categories.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper