LLaVA / docs /MODEL_ZOO.md
badayvedat's picture
feat: Add LLaVA model
a824a18

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

Model Zoo

To Use LLaVA-1.5 checkpoints, your llava package version must be newer than 1.1.0. Instructions on how to upgrade.

If you are interested in including any other details in Model Zoo, please open an issue :)

The model weights below are merged weights. You do not need to apply delta. The usage of LLaVA checkpoints should comply with the base LLM's model license: LLaMA.

LLaVA-v1.5

Version Size Schedule Checkpoint VQAv2 GQA VizWiz SQA T-VQA POPE MME MM-Bench MM-Bench-CN SEED LLaVA-Bench-Wild MM-Vet
LLaVA-1.5 7B full_ft-1e liuhaotian/llava-v1.5-7b 78.5 62.0 50.0 66.8 58.2 85.9 1510.7 64.3 58.3 58.6 63.4 30.5
LLaVA-1.5 13B full_ft-1e liuhaotian/llava-v1.5-13b 80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 61.6 70.7 35.4
LLaVA-1.5 7B lora-1e coming soon
LLaVA-1.5 13B lora-1e coming soon


LLaVA-1.5 achieves SoTA performance across 11 benchmarks.

LLaVA-v1

Note: We recommend using the most capable LLaVA-v1.5 series above for the best performance.

Base LLM Vision Encoder Pretrain Data Pretraining schedule Finetuning Data Finetuning schedule LLaVA-Bench-Conv LLaVA-Bench-Detail LLaVA-Bench-Complex LLaVA-Bench-Overall Download
Vicuna-13B-v1.3 CLIP-L-336px LCS-558K 1e LLaVA-Instruct-80K proj-1e, lora-1e 64.3 55.9 81.7 70.1 LoRA LoRA-Merged
LLaMA-2-13B-Chat CLIP-L LCS-558K 1e LLaVA-Instruct-80K full_ft-1e 56.7 58.6 80.0 67.9 ckpt
LLaMA-2-7B-Chat CLIP-L LCS-558K 1e LLaVA-Instruct-80K lora-1e 51.2 58.9 71.6 62.8 LoRA

Projector weights

The model weights below are projector weights we have pretrained. You can use these projector weights for visual instruction tuning. We'll add more projector weights into model zoo very soon.

NOTE: These projector weights are only compatible with the llava>=1.0.0, please check out the latest code base if your local code version is below v1.0.0.

NOTE: When you use our pretrained projector for visual instruction tuning, it is very important to use the same base LLM and vision encoder as the one we used for pretraining the projector. Otherwise, the performance will be very bad.

When using these projector weights to instruction tune your LMM, please make sure that these options are correctly set as follows,

--mm_use_im_start_end False
--mm_use_im_patch_token False
Base LLM Vision Encoder Projection Pretrain Data Pretraining schedule Download
Vicuna-13B-v1.5 CLIP-L-336px MLP-2x LCS-558K 1e projector
Vicuna-7B-v1.5 CLIP-L-336px MLP-2x LCS-558K 1e projector
LLaMA-2-13B-Chat CLIP-L-336px Linear LCS-558K 1e projector
LLaMA-2-7B-Chat CLIP-L-336px Linear LCS-558K 1e projector
LLaMA-2-13B-Chat CLIP-L Linear LCS-558K 1e projector
LLaMA-2-7B-Chat CLIP-L Linear LCS-558K 1e projector
Vicuna-13B-v1.3 CLIP-L-336px Linear LCS-558K 1e projector
Vicuna-7B-v1.3 CLIP-L-336px Linear LCS-558K 1e projector
Vicuna-13B-v1.3 CLIP-L Linear LCS-558K 1e projector
Vicuna-7B-v1.3 CLIP-L Linear LCS-558K 1e projector

Science QA Checkpoints

Base LLM Vision Encoder Pretrain Data Pretraining schedule Finetuning Data Finetuning schedule Download
Vicuna-13B-v1.3 CLIP-L LCS-558K 1e ScienceQA full_ft-12e ckpt

Legacy Models (merged weights)

The model weights below are merged weights. You do not need to apply delta. The usage of LLaVA checkpoints should comply with the base LLM's model license.

Base LLM Vision Encoder Pretrain Data Pretraining schedule Finetuning Data Finetuning schedule Download
MPT-7B-Chat CLIP-L LCS-558K 1e LLaVA-Instruct-80K full_ft-1e preview

Legacy Models (delta weights)

The model weights below are delta weights. The usage of LLaVA checkpoints should comply with the base LLM's model license: LLaMA.

You can add our delta to the original LLaMA weights to obtain the LLaVA weights.

Instructions:

  1. Get the original LLaMA weights in the huggingface format by following the instructions here.
  2. Use the following scripts to get LLaVA weights by applying our delta. It will automatically download delta weights from our Hugging Face account. In the script below, we use the delta weights of liuhaotian/LLaVA-7b-delta-v0 as an example. It can be adapted for other delta weights by changing the --delta argument (and base/target accordingly).
python3 -m llava.model.apply_delta \
    --base /path/to/llama-7b \
    --target /output/path/to/LLaVA-7B-v0 \
    --delta liuhaotian/LLaVA-7b-delta-v0
Base LLM Vision Encoder Pretrain Data Pretraining schedule Finetuning Data Finetuning schedule Download
Vicuna-13B-v1.1 CLIP-L CC-595K 1e LLaVA-Instruct-158K full_ft-3e delta-weights
Vicuna-7B-v1.1 CLIP-L LCS-558K 1e LLaVA-Instruct-80K full_ft-1e delta-weights
Vicuna-13B-v0 CLIP-L CC-595K 1e LLaVA-Instruct-158K full_ft-3e delta-weights
Vicuna-13B-v0 CLIP-L CC-595K 1e ScienceQA full_ft-12e delta-weights
Vicuna-7B-v0 CLIP-L CC-595K 1e LLaVA-Instruct-158K full_ft-3e delta-weights

Legacy Projector weights

The following projector weights are deprecated, and the support for them may be removed in the future. They do not support zero-shot inference. Please use the projector weights in the table above if possible.

NOTE: When you use our pretrained projector for visual instruction tuning, it is very important to use the same base LLM and vision encoder as the one we used for pretraining the projector. Otherwise, the performance will be very bad.

When using these projector weights to instruction tune your LMM, please make sure that these options are correctly set as follows,

--mm_use_im_start_end True
--mm_use_im_patch_token False
Base LLM Vision Encoder Pretrain Data Pretraining schedule Download
Vicuna-7B-v1.1 CLIP-L LCS-558K 1e projector
Vicuna-13B-v0 CLIP-L CC-595K 1e projector
Vicuna-7B-v0 CLIP-L CC-595K 1e projector

When using these projector weights to instruction tune your LMM, please make sure that these options are correctly set as follows,

--mm_use_im_start_end False
--mm_use_im_patch_token False
Base LLM Vision Encoder Pretrain Data Pretraining schedule Download
Vicuna-13B-v0 CLIP-L CC-595K 1e projector