arxiv:2310.03744

Improved Baselines with Visual Instruction Tuning

Published on Oct 5, 2023

· Submitted by

akhaliq on Oct 6, 2023

#2 Paper of the day

Authors:

Haotian Liu ,

Chunyuan Li ,

,

Yong Jae Lee

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

View arXiv page View PDF Add to collection

Community

Oct 7, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Paper author Feb 2

Check out our LLaVA-1.6 blog post as well!

LLaVA-1.6: Improved reasoning, OCR, and world knowledge
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/

Demo: https://llava.hliu.cc/

G1org

Feb 12

•

Demo: https://llava.hliu.cc/ Rip

Apr 4

There is also an updated technical report here: Improved Baselines with Visual Instruction Tuning

Jun 9

Unlocking the Power of Simple Modifications in Multimodal Learning

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

tejasrikadamati

Jul 24

This comment has been hidden

tejasrikadamati

Jul 24

This comment has been hidden

yhcao

Sep 24

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 27

Browse 27 models citing this paper

Datasets citing this paper 5

Browse 5 datasets citing this paper

Spaces citing this paper 40

Collections including this paper 19