--- language: - en datasets: - liuhaotian/LLaVA-Instruct-150K - liuhaotian/LLaVA-Pretrain --- ## Usage ```python import requests from PIL import Image import torch from transformers import AutoProcessor, LlavaForConditionalGeneration model_id = "nota-ai/phiva-4b-hf" prompt = "USER: \nWhat are these?\nASSISTANT:" image_file = "http://images.cocodataset.org/val2017/000000039769.jpg" model = LlavaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, attn_implementation="eager" ).to(0) processor = AutoProcessor.from_pretrained(model_id) raw_image = Image.open(requests.get(image_file, stream=True).raw) inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16) output = model.generate(**inputs, max_new_tokens=200, do_sample=False) print(processor.decode(output[0][inputs['input_ids'].shape[-1]:], skip_special_tokens=True)) ``` ## Terms of use The vision-language model published in this repository was developed by combining several modules (e.g., vision encoder, language model). Commercial use of any modifications, additions, or newly trained parameters made to combine these modules is not allowed. However, commercial use of the unmodified modules is allowed under their respective licenses. If you wish to use the individual modules commercially, you may refer to their original repositories and licenses provided below. Vision encoder (license) link : [Model](https://huggingface.co/openai/clip-vit-base-patch16), [License](https://github.com/openai/CLIP/blob/main/LICENSE) Language model (license) link : [Model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), [License](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/resolve/main/LICENSE) VLM framework (license) link: [Github](https://github.com/haotian-liu/LLaVA), [License](https://github.com/haotian-liu/LLaVA/blob/main/LICENSE)