metadata

language:
  - en
license_name: gemma-terms
license_link: https://ai.google.dev/gemma/terms

LLaVA-Gemma Model Card

This model card corresponds to the 2B version of the model with the CLIP-based vision encoder.

Overview

llava-gemma-2b is a large multimodal model (LMM) trained using the LLaVA-v1.5 framework with the 2-billion parameter google/gemma-2b-it model as language backbone.

Uses

The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.

Bias, Risks, and Limitations

This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.

How to Get Started with the Model

Currently using llava-gemma requires a modified preprocessor.

For example usage, see usage.py or the following code block:

import requests
from PIL import Image
from transformers import (
  LlavaForConditionalGeneration,
  AutoTokenizer,
  CLIPImageProcessor
)
from processing_llavagemma import LlavaGemmaProcessor # This is in this repo

checkpoint = "Intel/llava-gemma-2b"

# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = LlavaGemmaProcessor(
    tokenizer=AutoTokenizer.from_pretrained(checkpoint),
    image_processor=CLIPImageProcessor.from_pretrained(checkpoint)
)

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "What's the content of the image?<image>"}],
    tokenize=False,
    add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
inputs = {k: v.to('cuda') for k, v in inputs.items()}
      
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

Training Details

The llava-gemma-2b model was trained on 8 Gaudi 2 accelerators.

Training Data

The model was trained using the LLaVA-v1.5 data mixture.

This is listed as follows:

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
450K academic-task-oriented VQA data mixture.
40K ShareGPT data.

Evaluation

LM Backbone	Vision Model	Pretrained Connector	GQA	MME cognition	MME perception	MM-Vet	POPE accuracy	POPE F1	VQAv2	TextVQA	ScienceQA Image	MMVP
gemma-2b-it	CLIP	Yes	0.531	236.071	1130.492	17.706	0.850	0.839	70.65	28.06	0.564	0.287
gemma-2b-it	CLIP	No	0.481	247.857	934.611	13.119	0.784	0.762	61.74		0.549	0.180
gemma-7b-it	CLIP	Yes	0.472	253.571	894.910	18.165	0.848	0.829	68.7		0.625	0.327
gemma-7b-it	CLIP	No	0.472	278.214	857.274	19.083	0.782	0.734	65.09		0.636	0.240
gemma-2b-it	DinoV2	Yes	0.587	307.143	1132.970	19.128	0.853	0.838	71.37	12.53	0.555	0.227
gemma-2b-it	DinoV2	No	0.501	308.929	959.351	14.541	0.793	0.772	61.65	11.1	0.568	0.180