Papers
arxiv:2409.18869

Emu3: Next-Token Prediction is All You Need

Published on Sep 27
· Submitted by akhaliq on Sep 30
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

Community

Paper submitter
·

https://emu.baai.ac.cn/

clicking View PDF gives: No document for '2409.18869'
Paper here

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Despite the discussion around VideoPoet, this doesn't seem significantly different from the architecture presented there. As I understand the main differences highlighted by the authors here are:

  1. Emu3 does not perform a second super-resolution step
  2. Emu3 does not use a pre trained text encoder

However, these differences seem more superficial. It might be worthwhile to discuss, for e.g., the choice of MAGViT 2 vs SBER, as the choice of image tokenizer seems to be the real difference between the 2 works.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.18869 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.18869 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.18869 in a Space README.md to link it from this page.

Collections including this paper 7