Intel
/

llava-llama-3-8b

+---
+license: other
+license_name: intel-research-use-license
+license_link: LICENSE
+---
+# LLaVA-Llama3 Model Card
+_This model card corresponds to the instruction tuned 8B version of the model with the CLIP-based vision encoder._
+## Overview
+`llava-llama-3-8b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 8-billion parameter [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model as language backbone.
+## Uses
+The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
+## Bias, Risks, and Limitations
+This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
+## Training Details
+The `llava-llama-3-8b` model was trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators.
+### Training Data
+The model was trained using the LLaVA-v1.5 data mixture.
+This is listed as follows:
+- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
+- 158K GPT-generated multimodal instruction-following data.
+- 450K academic-task-oriented VQA data mixture.
+- 40K ShareGPT data.
+## Evaluation
+| Model    | Metrics          |
+|----------|------------------|
+| ScienceQA| 72.9797          |
+| MMVet    | 31.9725          |
+| llavaw   | 56.9/61.9/73.6/65.7 |
+| Pope Acc | 87.33, F1 86.5   |
+| GQA      | 60.6138          |
+| MMVP     | 36               |
+## License
+The weights are released under the Intel Research Use License Agreement (see LICENSE file)
+All usage code is licensed Apache 2.0
+## Usage
+Please note, we only provide the trained weights difference and do not provide a copy of the base [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model.  Any use of these weights requires a separate download of the base model.
+```
+# Copyright 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+import requests
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForPreTraining
+import transformers
+def expand2square(pil_img, background_color):
+    width, height = pil_img.size
+    if width == height:
+        return pil_img
+    elif width > height:
+        result = Image.new(pil_img.mode, (width, width), background_color)
+        result.paste(pil_img, (0, (width - height) // 2))
+        return result
+    else:
+        result = Image.new(pil_img.mode, (height, height), background_color)
+        result.paste(pil_img, ((height - width) // 2, 0))
+        return result
+def add_model_a_to_b(model_a, model_b):
+    state_dict_a = model_a.state_dict()
+    state_dict_b = model_b.state_dict()
+    # Ensure keys match before subtraction
+    if set(state_dict_a.keys()) != set(state_dict_b.keys()):
+        raise ValueError("Model state dicts do not have the same keys.")
+    for key in state_dict_a:
+        if state_dict_a[key].shape != state_dict_b[key].shape:
+            raise ValueError(f"Shape mismatch for key '{key}': {state_dict_a[key].shape} vs {state_dict_b[key].shape}")
+        # Subtract model_a's weights from model_b for the matching key
+        state_dict_b[key] = state_dict_b[key] + state_dict_a[key]
+    # Update model_b with the new weights
+    model_b.load_state_dict(state_dict_b)
+output_checkpoint = "" # set if you don't want to merge every time
+hf_checkpoint = "Intel/llava-llama-3-8b-old"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+processor = AutoProcessor.from_pretrained(hf_checkpoint)
+model = AutoModelForPreTraining.from_pretrained(hf_checkpoint)
+if model.language_model.model.embed_tokens.weight[-1].sum() == 0:
+    print("adding llama3 weights")
+    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+    pipeline = transformers.pipeline(
+        "text-generation",
+        model=model_id,
+        model_kwargs={"torch_dtype": torch.bfloat16},
+        device_map="cpu",
+    )
+    llama3 = pipeline.model
+    add_model_a_to_b(llama3, model.language_model)
+    if output_checkpoint:
+        print("saving weights, so no adding is needed again")
+        model.save_pretrained(output_checkpoint)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model.to(device)
+prompt = processor.tokenizer.apply_chat_template(
+    [{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
+    tokenize=False,
+    add_generation_prompt=True
+)
+url = "https://www.ilankelman.org/stopsigns/australia.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+#original llava pads with mean, HF llava pads with zeros
+image = expand2square(image, tuple(int(x*255) for x in processor.image_processor.image_mean))
+inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
+# Generate
+generate_ids = model.generate(**inputs, max_length=30)
+output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+print(output)
+```