Papers
arxiv:2408.08459

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Published on Aug 15
· Submitted by akhaliq on Aug 19
#2 Paper of the day
Authors:
,
,

Abstract

Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

Community

Paper submitter

Screen Shot 2024-08-18 at 9.49.45 PM.png

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Cool! Would you like to make it open-source?

Currently you use a full auto-regressive model. Would you like to try parallel decoding, like MagViT or MaskGIT?

·
Paper author

Unfortunately, we cannot release the exact training code/model checkpoints at this moment. But if you search online there might be some (partial) reimplementations. The parallel decoding you mentioned (or discrete diffusion-style methods) can be possible, but I'll say it's challenging in our case since they usually favor fixed-length/dimension inputs, whereas JPEG produces variable-length representations.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2408.08459 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.08459 in a Space README.md to link it from this page.

Collections including this paper 9