llava-llama-3-8b / README.md

Updating Model card (#4)

5fdc933 verified 5 months ago

7.86 kB

	---
	license: other
	license_name: intel-research-use-license
	license_link: LICENSE
	tags:
	- intel
	- gaudi
	- LLM
	results:
	- task:
	type: Large Language Model
	name: Large Language Model
	metrics:
	- type: GQA
	name: GQA
	value: 60.6138
	- type: MMVP
	name: MMVP
	value: 36
	- type: Pope Acc
	name: Pope Acc
	value: 87.33
	- type: Pope F1
	name: Pope F1
	value: 86.5
	- type: MMVet
	name: MMVet
	value: 31.9725
	- type: ScienceQA
	name: ScienceQA
	value: 72.9797
	- type: llavaw (1)
	name: llavaw
	value: 56.9
	- type: llavaw (2)
	name: llavaw
	value: 61.9
	- type: llavaw (3)
	name: llavaw
	value: 73.6
	- type: llavaw (4)
	name: llavaw
	value: 65.7

	library_name: transformers
	pipeline_tag: image-text-to-text
	---

	## Model Details: LLaVA-llama-3-8B

	`llava-llama-3-8b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 8-billion parameter [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model as language backbone and the CLIP-based vision encoder.

	\| Model Details \| Description \|
	\| ----------- \| ----------- \|
	\| Authors \| Intel: [Musashi Hinck](https://huggingface.co/musashihinck), [Matthew L. Olson](https://huggingface.co/matthewlyleolson), [Vasudev Lal](https://huggingface.co/vasudevlal) \|
	\| Date \| May 2024 \|
	\| Version \| 1 \|
	\| Type \| Large multimodal model (LMM) \|
	\| Paper or Other Resources \| [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/abs/2310.03744) \|
	\| License \| [Intel Research Use License](https://huggingface.co/Intel/llava-llama-3-8b/blob/main/LICENSE) \| All usage code is licensed Apache 2.0
	\| Questions or Comments \| [Community Tab](https://huggingface.co/Intel/llava-llama-3-8b/discussions) and [Intel DevHub Discord](https://discord.gg/rv2Gp55UJQ)\|

	This model card was created by [Eduardo Alvarez](https://huggingface.co/eduardo-alvarez) and the authors listed above.

	## Intended Use

	\| Intended Use \| Description \|
	\| ----------- \| ----------- \|
	\| Primary intended uses \| The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot. \|
	\| Primary intended users \| Anyone using or evaluating multimodal models. \|
	\| Out-of-scope uses \| This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights. \|


	### How to use

	Please note, we only provide the trained weights difference and do not provide a copy of the base meta-llama/Meta-Llama-3-8B-Instruct model. Any use of these weights requires a separate download of the base model.

	```python
	# Copyright 2024 Intel Corporation
	# SPDX-License-Identifier: Apache-2.0

	import requests
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoModelForPreTraining
	import transformers

	def expand2square(pil_img, background_color):
	width, height = pil_img.size
	if width == height:
	return pil_img
	elif width > height:
	result = Image.new(pil_img.mode, (width, width), background_color)
	result.paste(pil_img, (0, (width - height) // 2))
	return result
	else:
	result = Image.new(pil_img.mode, (height, height), background_color)
	result.paste(pil_img, ((height - width) // 2, 0))
	return result

	def add_model_a_to_b(model_a, model_b):
	state_dict_a = model_a.state_dict()
	state_dict_b = model_b.state_dict()

	# Ensure keys match before subtraction
	if set(state_dict_a.keys()) != set(state_dict_b.keys()):
	raise ValueError("Model state dicts do not have the same keys.")

	for key in state_dict_a:
	if state_dict_a[key].shape != state_dict_b[key].shape:
	raise ValueError(f"Shape mismatch for key '{key}': {state_dict_a[key].shape} vs {state_dict_b[key].shape}")
	# Subtract model_a's weights from model_b for the matching key
	state_dict_b[key] = state_dict_b[key] + state_dict_a[key]

	# Update model_b with the new weights
	model_b.load_state_dict(state_dict_b)

	output_checkpoint = "" # set if you don't want to merge every time
	hf_checkpoint = "Intel/llava-llama-3-8b"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	processor = AutoProcessor.from_pretrained(hf_checkpoint)
	model = AutoModelForPreTraining.from_pretrained(hf_checkpoint)
	if model.language_model.model.embed_tokens.weight[-1].sum() == 0:
	print("adding llama3 weights")
	model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="cpu",
	)
	llama3 = pipeline.model
	add_model_a_to_b(llama3, model.language_model)
	if output_checkpoint:
	print("saving weights, so no adding is needed again")
	model.save_pretrained(output_checkpoint)

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)

	prompt = processor.tokenizer.apply_chat_template(
	[{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
	tokenize=False,
	add_generation_prompt=True
	)

	url = "https://www.ilankelman.org/stopsigns/australia.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	#original llava pads with mean, HF llava pads with zeros
	image = expand2square(image, tuple(int(x*255) for x in processor.image_processor.image_mean))
	inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
	# Generate
	generate_ids = model.generate(**inputs, max_length=30)
	output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
	print(output)
	```

	## Factors

	\| Factors \| Description \|
	\| ----------- \| ----------- \|
	\| Environment \| Trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators \|
	\| Card Prompts \| Model training and deployment on alternate hardware and software will change model performance \|

	## Training Data

	The model was trained using the LLaVA-v1.5 data mixture. This is listed as follows:

	- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
	- 158K GPT-generated multimodal instruction-following data.
	- 450K academic-task-oriented VQA data mixture.
	- 40K ShareGPT data.

	## Ethical Considerations

	Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See [Intel’s Global Human Rights Principles](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf). Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

	\| Ethical Considerations \| Description \|
	\| ----------- \| ----------- \|
	\| Data \| The model was trained using the LLaVA-v1.5 data mixture as described above. \|
	\| Human life \| The model is not intended to inform decisions central to human life or flourishing. \|
	\| Mitigations \| No additional risk mitigation strategies were considered during model development. \|
	\| Risks and harms \| This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm. \|
	\| Use cases \| - \|

	## Caveats and Recommendations

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.