mapo-beta / README.md

sayakpaul HF staff

Update README.md

7d6af58 verified 4 months ago

preview code

raw

history blame contribute delete

No virus

5.02 kB

	---
	license: openrail++
	library_name: diffusers
	tags:
	- text-to-image
	- text-to-image
	- diffusers-training
	- diffusers
	- stable-diffusion-xl
	- stable-diffusion-xl-diffusers
	base_model: stabilityai/stable-diffusion-xl-base-1.0
	---

	# Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

	<div align="center">
	<img src="https://github.com/mapo-t2i/mapo/blob/main/assets/mapo_overview.png?raw=true" width=750/>
	</div><br>

	We propose MaPO, a reference-free, sample-efficient, memory-friendly alignment technique for text-to-image diffusion models. For more details on the technique, please refer to our paper [here](https://arxiv.org/abs/2406.06424).


	## Developed by

	* Jiwoo Hong<sup>*</sup> (KAIST AI)
	* Sayak Paul<sup>*</sup> (Hugging Face)
	* Noah Lee (KAIST AI)
	* Kashif Rasul (Hugging Face)
	* James Thorne (KAIST AI)
	* Jongheon Jeong (Korea University)

	## Dataset

	This model was fine-tuned from [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) on the [yuvalkirstain/pickapic_v2](mhttps://huggingface.co/datasets/yuvalkirstain/pickapic_v2) dataset.

	## Training Code

	Refer to our code repository [here](https://github.com/mapo-t2i/mapo).

	## Qualitative Comparison

	<div align="center">
	<img src="assets/comparison.png" width=750/>
	</div>


	## Results

	Below we report some quantitative metrics and use them to compare MaPO to existing models:

	<style>
	table {
	width: 100%;
	border-collapse: collapse;
	}
	th, td {
	border: 1px solid #000;
	padding: 8px;
	text-align: center;
	}
	th {
	background-color: #808080;
	}
	.ours {
	font-style: italic;
	}
	</style>

	<table>
	<caption>Average score for Aesthetic, HPS v2.1, and PickScore</caption>
	<thead>
	<tr>
	<th></th>
	<th>Aesthetic</th>
	<th>HPS v2.1</th>
	<th>Pickscore</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>SDXL</td>
	<td>6.03</td>
	<td>30.0</td>
	<td>22.4</td>
	</tr>
	<tr>
	<td>SFT<sub>Chosen</sub></td>
	<td>5.95</td>
	<td>29.6</td>
	<td>22.0</td>
	</tr>
	<tr>
	<td>Diffusion-DPO</td>
	<td>6.03</td>
	<td>31.1</td>
	<td><b>22.7<b/></td>
	</tr>
	<tr>
	<td><b>MaPO (Ours)<b/></td>
	<td><b>6.17<b/></td>
	<td><b>31.2<b/></td>
	<td>22.5</td>
	</tr>
	</tbody>
	</table>


	We evaluated this checkpoint in the Imgsys public benchmark. MaPO was able to outperform or match 21 out of 25 state-of-the-art text-to-image diffusion models by ranking 7th on the leaderboard at the time of writing, compared to Diffusion-DPO’s 20th place, while also consuming 14.5% less wall-clock training time on adapting Pick-a-Pic v2. We appreciate the imgsys team for helping us get the human preference data.

	<div align="center">
	<img src="https://mapo-t2i.github.io/static/images/imgsys.png" width=750/>
	</div>

	The table below reports memory efficiency of MaPO, making it a better alternative for alignment fine-tuning of diffusion models:

	<table>
	<caption>Computational costs of Diffusion-DPO and MaPO</caption>
	<thead>
	<tr>
	<th></th>
	<th>Diffusion-DPO</th>
	<th>MaPO <span class="ours">(Ours)</span></th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Time (↓)</td>
	<td>63.5</td>
	<td><b>54.3 (-14.5%)</b></td>
	</tr>
	<tr>
	<td>GPU Mem. (↓)</td>
	<td>55.9</td>
	<td><b>46.1 (-17.5%)</b></td>
	</tr>
	<tr>
	<td>Max Batch (↑)</td>
	<td>4</td>
	<td><b>16 (×4)</b></td>
	</tr>
	</tbody>
	</table>


	## Inference

	```python
	from diffusers import DiffusionPipeline, AutoencoderKL, UNet2DConditionModel
	import torch

	sdxl_id = "stabilityai/stable-diffusion-xl-base-1.0"
	vae_id = "madebyollin/sdxl-vae-fp16-fix"
	unet_id = "mapo-t2i/mapo-beta"

	vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16)
	unet = UNet2DConditionModel.from_pretrained(unet_id, torch_dtype=torch.float16)
	pipeline = DiffusionPipeline.from_pretrained(sdxl_id, vae=vae, unet=unet, torch_dtype=torch.float16).to("cuda")

	prompt = "An abstract portrait consisting of bold, flowing brushstrokes against a neutral background."
	image = pipeline(prompt=prompt, num_inference_steps=30).images[0]
	```

	For qualitative results, please visit our [project website](https://mapo-t2i.github.io/).

	## Citation

	```bibtex
	@misc{hong2024marginaware,
	title={Margin-aware Preference Optimization for Aligning Diffusion Models without Reference},
	author={Jiwoo Hong and Sayak Paul and Noah Lee and Kashif Rasul and James Thorne and Jongheon Jeong},
	year={2024},
	eprint={2406.06424},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```