Update README.md

3c4d063 verified 4 months ago

5.25 kB

	---
	license: other
	license_name: tongyi-qianwen
	license_link: LICENSE
	tags:
	- vision
	- image-text-to-text
	language:
	- en
	pipeline_tag: image-text-to-text
	---

	# LLaVa-Next Model Card

	The LLaVA-NeXT model was proposed in [LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild
	](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/) by Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, Chunyuan Li.
	These LLaVa-NeXT series improves upon [LLaVa-1.6](https://llava-vl.github.io/blog/2024-01-30-llava-next/) by training with stringer language backbones, improving the
	performance.

	Disclaimer: The team releasing LLaVa-NeXT did not write a model card for this model so this model card has been written by the Hugging Face team.

	## Model description

	LLaVa combines a pre-trained large language model with a pre-trained vision encoder for multimodal chatbot use cases. LLaVA NeXT Llama3 improves on LLaVA 1.6 BY:
	- More diverse and high quality data mixture
	- Better and bigger language backbone

	Base LLM: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png)

	## Intended uses & limitations

	You can use the raw model for tasks like image captioning, visual question answering, multimodal chatbot use cases. See the [model hub](https://huggingface.co/models?search=llava-hf) to look for
	other versions on a task that interests you.

	### How to use

	You can load and use the model like following:
	```python
	from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
	import torch
	from PIL import Image
	import requests

	processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-next-110b-hf")
	model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-next-110b-hf", torch_dtype=torch.float16, device_map="auto")

	# prepare image and text prompt, using the appropriate prompt template
	url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
	image = Image.open(requests.get(url, stream=True).raw)

	# Define a chat histiry and use `apply_chat_template` to get correctly formatted prompt
	# Each value in "content" has to be a list of dicts with types ("text", "image")
	conversation = [
	{

	"role": "user",
	"content": [
	{"type": "text", "text": "What is shown in this image?"},
	{"type": "image"},
	],
	},
	]
	prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

	inputs = processor(prompt, image, return_tensors="pt").to(model.device)

	# autoregressively complete prompt
	output = model.generate(**inputs, max_new_tokens=100)

	print(processor.decode(output[0], skip_special_tokens=True))
	```

	### Model optimization

	#### 4-bit quantization through `bitsandbytes` library

	First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:

	```diff
	model = LlavaNextForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	+ load_in_4bit=True
	)
	```

	#### Use Flash-Attention 2 to further speed-up generation

	First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:

	```diff
	model = LlavaNextForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	+ use_flash_attention_2=True
	).to(0)
	```


	### Training Data
	- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
	- 158K GPT-generated multimodal instruction-following data.
	- 500K academic-task-oriented VQA data mixture.
	- 50K GPT-4V data mixture.
	- 40K ShareGPT data.


	### License Notices
	This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama-1/2 community license for LLaMA-2 and Vicuna-v1.5, Tongyi Qianwen LICENSE AGREEMENT and META LLAMA 3 COMMUNITY LICENSE AGREEMENT). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.



	### BibTeX entry and citation info

	```bibtex
	@misc{li2024llavanext-strong,
	title={LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild},
	url={https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/},
	author={Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan},
	month={May},
	year={2024}
	}
	```