Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

This is the official implementation of DiGIT (Github) accepted at NeurIPS 2024. The code will be available soon.

Overview

We present DiGIT, an auto-regressive generative model performing next-token prediction in an abstract latent space derived from self-supervised learning (SSL) models. By employing K-Means clustering on the hidden states of the DINOv2 model, we effectively create a novel discrete tokenizer. This method significantly boosts image generation performance on ImageNet dataset, achieving an FID score of 4.59 for class-unconditional tasks and 3.39 for class-conditional tasks. Additionally, the model enhances image understanding, attaining a linear-probe accuracy of 80.3.

Experimental Results

Linear-Probe Accuracy on ImageNet

Methods	# Tokens	Features	# Params	Top-1 Acc. $\uparrow$
iGPT-L	32 $\times$ 32	1536	1362M	60.3
iGPT-XL	64 $\times$ 64	3072	6801M	68.7
VIM+VQGAN	32 $\times$ 32	1024	650M	61.8
VIM+dVAE	32 $\times$ 32	1024	650M	63.8
VIM+ViT-VQGAN	32 $\times$ 32	1024	650M	65.1
VIM+ViT-VQGAN	32 $\times$ 32	2048	1697M	73.2
AIM	16 $\times$ 16	1536	0.6B	70.5
DiGIT (Ours)	16 $\times$ 16	1024	219M	71.7
DiGIT (Ours)	16 $\times$ 16	1536	732M	80.3

Class-Unconditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)

Type	Methods	# Param	# Epoch	FID $\downarrow$	IS $\uparrow$
GAN	BigGAN	70M	-	38.6	24.70
Diff.	LDM	395M	-	39.1	22.83
Diff.	ADM	554M	-	26.2	39.70
MIM	MAGE	200M	1600	11.1	81.17
MIM	MAGE	463M	1600	9.10	105.1
MIM	MaskGIT	227M	300	20.7	42.08
MIM	DiGIT (+MaskGIT)	219M	200	9.04	75.04
AR	VQGAN	214M	200	24.38	30.93
AR	DiGIT (+VQGAN)	219M	400	9.13	73.85
AR	DiGIT (+VQGAN)	732M	200	4.59	141.29

Class-Conditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)

Type	Methods	# Param	# Epoch	FID $\downarrow$	IS $\uparrow$
GAN	BigGAN	160M	-	6.95	198.2
Diff.	ADM	554M	-	10.94	101.0
Diff.	LDM-4	400M	-	10.56	103.5
Diff.	DiT-XL/2	675M	-	9.62	121.50
Diff.	L-DiT-7B	7B	-	6.09	153.32
MIM	CQR-Trans	371M	300	5.45	172.6
MIM+AR	VAR	310M	200	4.64	-
MIM+AR	VAR	310M	200	3.60*	257.5*
MIM+AR	VAR	600M	250	2.95*	306.1*
MIM	MAGVIT-v2	307M	1080	3.65	200.5
AR	VQVAE-2	13.5B	-	31.11	45
AR	RQ-Trans	480M	-	15.72	86.8
AR	RQ-Trans	3.8B	-	7.55	134.0
AR	ViTVQGAN	650M	360	11.20	97.2
AR	ViTVQGAN	1.7B	360	5.3	149.9
MIM	MaskGIT	227M	300	6.18	182.1
MIM	DiGIT (+MaskGIT)	219M	200	4.62	146.19
AR	VQGAN	227M	300	18.65	80.4
AR	DiGIT (+VQGAN)	219M	400	4.79	142.87
AR	DiGIT (+VQGAN)	732M	200	3.39	205.96

*: VAR is trained with classifier-free guidance while all the other models are not.

Checkpoints

The K-Means npy file and model checkpoints can be downloaded from:

Model	Link
HF weights🤗	Huggingface

For the base model we use DINOv2-base and DINOv2-large for large size model. The VQGAN we use is the same as MAGE.

DiGIT
└── data/
    ├── ILSVRC2012
        ├── dinov2_base_short_224_l3
            ├── km_8k.npy
        ├── dinov2_large_short_224_l3
            ├── km_16k.npy
└── outputs/
    ├── base_8k_stage1
    ├── ...
└── models/
    ├── vqgan_jax_strongaug.ckpt
    ├── dinov2_vitb14_reg4_pretrain.pth
    ├── dinov2_vitl14_reg4_pretrain.pth

The training and inference code can be found at our github repo https://github.com/DAMO-NLP-SG/DiGIT

Citation

If you find our project useful, hope you can star our repo and cite our work as follows.

@misc{zhu2024stabilize,
    title={Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective},
    author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
    year={2024},
    eprint={2410.12490},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

DAMO-NLP-SG
/

DiGIT