arxiv:2407.06135

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Published on Jul 8

· Submitted by

ethanchern on Jul 9

Upvote

Authors:

Ethan Chern ,

Jiadi Su ,

Pengfei Liu

Abstract

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.

View arXiv page View PDF Add to collection

Community

ethanchern

Paper author Paper submitter Jul 9

•

edited Jul 9

Abstract

Homepage: https://gair-nlp.github.io/anole
Code: https://github.com/GAIR-NLP/anole
Hugging Face Model: https://huggingface.co/GAIR/Anole-7b-v0.1

nielsr

Jul 12

Hi @ethanchern congrats on this work! Some remarks:

Would you be able to link the HF model to this paper page? See here on how to do that: https://huggingface.co/docs/hub/en/model-cards#linking-a-paper
Currently downloads aren't working for your model, this is because the model is part of a "model" folder rather than being at the root of the repo. See here regarding how you can make downloads work: https://huggingface.co/docs/hub/models-download-stats

Let me know if you need any help!