Diffusion Models Beat GANs on Image Classification
Abstract
While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.
Community
Diffusion models can be unified/universal representation learners that address generative and discriminative families/paradigms in a single pretraining stage. Intermediate features/learned latent representations can be used for different downstream tasks (like classification). Diffusion models add Gaussian noise to image (in forward steps) and learn to denoise/recover the underlying image from noise. Uses Guided Diffusion containing UNet with residual blocks, MHSA, scale-shift norm, BigGAN residuals for upsampling & downsampling. Extract features (activations) at a block at a diffusion time step. Better classification accuracy than BigBiGAN (but not MAGE). Fine-Grained Visual Classification (FGVC): SimCLR and SwAV are better. Feature representations compared using centered kernel alignment (CKA). Appendix has implementation details for convolution and attention heads, and more ablations. From University of Maryland.
Links: PapersWithCode
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper