File size: 1,925 Bytes

---
language:
- en
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
---

## Usage
```python
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "nota-ai/phiva-4b-hf"

prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
    attn_implementation="eager"
).to(0)

processor = AutoProcessor.from_pretrained(model_id)


raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][inputs['input_ids'].shape[-1]:], skip_special_tokens=True))
```

## Terms of use
The vision-language model published in this repository was developed by combining several modules (e.g., vision encoder, language model). Commercial use of any modifications, additions, or newly trained parameters made to combine these modules is not allowed.
However, commercial use of the unmodified modules is allowed under their respective licenses. If you wish to use the individual modules commercially, you may refer to their original repositories and licenses provided below.


Vision encoder (license) link : [Model](https://huggingface.co/openai/clip-vit-base-patch16), [License](https://github.com/openai/CLIP/blob/main/LICENSE)

Language model (license) link : [Model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), [License](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/resolve/main/LICENSE)

VLM framework (license) link: [Github](https://github.com/haotian-liu/LLaVA), [License](https://github.com/haotian-liu/LLaVA/blob/main/LICENSE)