|
--- |
|
language: |
|
- en |
|
tags: |
|
- Vision |
|
- HelpingAI |
|
license: mit |
|
library_name: transformers |
|
base_model: visheratin/MC-LLaVA-3b |
|
widget: |
|
- text: What animal is it? |
|
src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg |
|
- text: Where is it? |
|
src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# HelpingAI-Vision |
|
|
|
<a target="_blank" href="https://colab.research.google.com/drive/1t2OAMVSKsiqVgvuHq7rhyNv28b67u0D8"> |
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> |
|
</a> |
|
|
|
## Model details |
|
|
|
The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the HelpingAI-Lite and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information. |
|
|
|
For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update. |
|
|
|
HelpingAI-Vision was fine-tuned from MC-LLaVA-3b. |
|
|
|
The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free ask |
|
``` |
|
<|im_start|>system |
|
You are Vortex, a helpful AI assistant.<|im_end|> |
|
<|im_start|>user |
|
{prompt}<|im_end|> |
|
<|im_start|>assistant |
|
``` |
|
|
|
## How to use |
|
|
|
**Install dependencies** |
|
|
|
```bash |
|
!pip install -q open_clip_torch timm einops |
|
``` |
|
|
|
**Download modeling files** |
|
|
|
```python |
|
from huggingface_hub import hf_hub_download |
|
|
|
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_llava.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_phi.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_llava.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_phi.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="processing_llava.py", local_dir="./", force_download=True) |
|
``` |
|
|
|
**Create a model** |
|
|
|
```python |
|
from modeling_llava import LlavaForConditionalGeneration |
|
import torch |
|
|
|
model = LlavaForConditionalGeneration.from_pretrained("OEvortex/HelpingAI-Vision", torch_dtype=torch.float16) |
|
model = model.to("cuda") |
|
``` |
|
|
|
**Create processors** |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-Vision") |
|
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config) |
|
processor = LlavaProcessor(image_processor, tokenizer) |
|
``` |
|
|
|
**Set image and text** |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
|
|
image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da" |
|
raw_image = Image.open(requests.get(image_file, stream=True).raw) |
|
|
|
prompt = """<|im_start|>system |
|
A chat between a curious human and an artificial intelligence assistant. |
|
The assistant gives helpful, detailed, and polite answers to the human's questions. |
|
The assistant does not hallucinate and pays very close attention to the details.<|im_end|> |
|
<|im_start|>user |
|
<image> |
|
Describe the image.<|im_end|> |
|
<|im_start|>assistant |
|
""" |
|
``` |
|
|
|
**Process inputs** |
|
|
|
```python |
|
with torch.inference_mode(): |
|
inputs = processor(prompt, raw_image, model, return_tensors='pt') |
|
|
|
inputs['input_ids'] = inputs['input_ids'].to(model.device) |
|
inputs['attention_mask'] = inputs['attention_mask'].to(model.device) |
|
|
|
from transformers import TextStreamer |
|
|
|
streamer = TextStreamer(tokenizer) |
|
``` |
|
|
|
**Generate the data** |
|
|
|
```python |
|
%%time |
|
with torch.inference_mode(): |
|
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=1.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer) |
|
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", "")) |
|
``` |