Papers
arxiv:2410.09347

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Published on Oct 12
· Submitted by ChenDRAG on Oct 18
Authors:
,

Abstract

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose Condition Contrastive Alignment (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning (sim 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

Community

Paper author Paper submitter

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

(TL;DR) We propose CCA as a finetuning technique for AR visual models so that they can generate high-quality images without CFG, cutting sampling costs by half. CCA and CFG have the same theoretical foundations and thus similar features, though CCA is inspired from LLM alignment instead of guided sampling.

compare.JPG

Features of CCA:

  • High performance. CCA can vastly improve the guidance-free performance of all tested AR visual models, largely removing the need for CFG. (Figure below)
  • Convenient to deploy. CCA does not require any additional datasets other than the one used for pretraining.
  • Fast to train. CCA requires only finetuning pretrained models for 1 epoch to achieve ideal performance (~1% computation of pretraining).
  • Consistency with LLM Alignment. CCA is theoretically foundationed on existing LLM alignment methods, and bridges the gap between visual-targeted guidance and language-targeted alignment, offering a unified framework for mixed-modal modeling.

Github: https://github.com/thu-ml/CCA/tree/main

result.JPG
overview.JPG

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.09347 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.09347 in a Space README.md to link it from this page.

Collections including this paper 1