arxiv:2309.15807

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Published on Sep 27, 2023

· Submitted by

akhaliq on Sep 28, 2023

#1 Paper of the day

Upvote

Authors:

Ji Hou ,

Chih-Yao Ma ,

Sam Tsai ,

Rui Wang ,

Peizhao Zhang ,

Simon Vandenhende ,

Xiaofang Wang ,

Abhimanyu Dubey ,

Matthew Yu ,

Abhishek Kadian ,

Filip Radenovic ,

Dhruv Mahajan ,

Kunpeng Li ,

Yue Zhao ,

Vladan Petrovic ,

Simran Motwani ,

Yi Wen ,

Yiwen Song ,

Roshan Sumbaly

Abstract

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

View arXiv page View PDF Add to collection

Community

ywlee88

Sep 28, 2023

Do you have a plan to release the high-quality image dataset for your quality-tuning?
It would be very helpful to the vision community.

osanseviero

Sep 28, 2023

•

edited Sep 28, 2023

Here is a AI-generated summary

Objective

The paper proposes quality-tuning, fine-tuning a pre-trained text-to-image model on a small set of exceptionally high-quality images, to align the model to generate highly aesthetic images.

The key insight is that fine-tuning on just a few thousand carefully selected, high-quality images can significantly improve the visual appeal of generated images without compromising generality.

Insights

Fine-tuning on just a few thousand carefully selected, high-quality images can significantly improve visual appeal.
Image quality is far more important than quantity for the fine-tuning data.
Following basic principles of photography leads to more aesthetic images across different styles.
Quality-tuning improves visual appeal without sacrificing generality of concepts or faithfulness.
Quality-tuning is effective for various architectures like pixel diffusion and masked transformers.
Quality-tuning is analogous to instruction tuning for language models - both require high-quality data.

Results
The resulting quality-tuned model Emu significantly outperforms the pre-trained model and SOTA model SDXLv1.0 in visual appeal, preferred over 70% of the time.

pekoragg

Sep 29, 2023

Anyone implementing the channel increase for VAEs?

knoopx

Sep 29, 2023

Another paper that could be just a blog post. Not sure where the novelty is.

librarian-bot

Oct 4, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

eeyrw

Oct 8, 2023

Another paper that could be just a blog post. Not sure where the novelty is.

Fine tuning over quality images for text2img model has been done more than 1 year by community. Furthermore, the paper does not disclose any detail hyper paramter fof finetune.