File size: 6,158 Bytes
5f0fa15 14cf27e 5f0fa15 14cf27e 5f0fa15 14cf27e 5f0fa15 14cf27e 5f0fa15 14cf27e 5f0fa15 14cf27e 5f0fa15 14cf27e 5f0fa15 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
license: apache-2.0
datasets:
- toshi456/llava-jp-instruct-108k
- turing-motors/LLaVA-Pretrain-JA
language:
- ja
pipeline_tag: image-to-text
---
# LLaVA-JP Model Card
## Model detail
**Model type:**
LLaVA-JP is a vision-language model that can converse about input images.<br>
This model is an LVLM model trained using [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) as the image encoder and [llm-jp/llm-jp-1.3b-v1.0](https://huggingface.co/llm-jp/llm-jp-1.3b-v1.0) as the text decoder. supports the input of 768 x 768 high resolution images by scaling_on_scales method.
**Training:**
This model was initially trained with the Vision Projector using LLaVA-Pretrain-JA.<br>
In the second phase, it was fine-tuned with LLaVA-JP-Instruct-108K.
resources for more information: https://github.com/tosiyuki/LLaVA-JP/tree/main
**Comparing VLMs**
|Model|JA-VG-VQA-500<br>(ROUGE-L)|JA-VLM-Bench-In-the-Wild<br>(ROUGE-L)|Heron-Bench(Detail)|Heron-Bench(Conv)|Heron-Bench(Complex)|Heron-Bench(Average)
|-|-|-|-|-|-|-|
|[Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm)|-|40.50|25.15|51.23|37.84|38.07|
|[EvoVLM-JP-v1-7B](https://huggingface.co/SakanaAI/EvoVLM-JP-v1-7B)|**19.70**|**51.25**|50.31|44.42|40.47|45.07|
|[Heron BLIP Japanese StableLM Base 7B llava-620k](https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v1-llava-620k)|14.51|33.26|49.09|41.51|45.72|45.44|
|[Heron GIT Japanese StableLM Base 7B](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v1)|15.18|37.82|42.77|**54.20**|43.53|46.83|
|[llava-jp-1.3b-v1.1](https://huggingface.co/toshi456/llava-jp-1.3b-v1.1)|13.33|44.40|50.00|51.83|**48.98**|**50.39**|
|[llava-jp-1.3b-v1.1-llava-jp-instruct-108k](https://huggingface.co/toshi456/llava-jp-1.3b-v1.1-llava-jp-instruct-108k)|-|17.07|**50.60**|45.31|33.24|41.52|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/630af71ffaaea618ebc973db/SIXXIqwp-voffOXKZouqb.png)
## How to use the model
**1. Download dependencies**
```
git clone https://github.com/tosiyuki/LLaVA-JP.git
```
**2. Inference**
```python
import torch
import transformers
from PIL import Image
from transformers.generation.streamers import TextStreamer
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.llava_gpt2 import LlavaGpt2ForCausalLM
from llava.train.dataset import tokenizer_image_token
if __name__ == "__main__":
model_path = 'toshi456/llava-jp-1.3b-v1.1-llava-jp-instruct-108k'
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32
model = LlavaGpt2ForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
use_safetensors=True,
torch_dtype=torch_dtype,
device_map=device,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_path,
model_max_length=1532,
padding_side="right",
use_fast=False,
)
model.eval()
conv_mode = "v1"
conv = conv_templates[conv_mode].copy()
# image pre-process
image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
image_size = model.get_model().vision_tower.image_processor.size["height"]
if model.get_model().vision_tower.scales is not None:
image_size = model.get_model().vision_tower.image_processor.size["height"] * len(model.get_model().vision_tower.scales)
if device == "cuda":
image_tensor = model.get_model().vision_tower.image_processor(
image,
return_tensors='pt',
size={"height": image_size, "width": image_size}
)['pixel_values'].half().cuda().to(torch_dtype)
else:
image_tensor = model.get_model().vision_tower.image_processor(
image,
return_tensors='pt',
size={"height": image_size, "width": image_size}
)['pixel_values'].to(torch_dtype)
# create prompt
# ユーザー: <image>\n{prompt}
prompt = "画像について説明してください。"
inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(
prompt,
tokenizer,
IMAGE_TOKEN_INDEX,
return_tensors='pt'
).unsqueeze(0)
if device == "cuda":
input_ids = input_ids.to(device)
input_ids = input_ids[:, :-1] # </sep>がinputの最後に入るので削除する
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0)
# predict
with torch.inference_mode():
output_id = model.generate(
inputs=input_ids,
images=image_tensor,
do_sample=False,
temperature=1.0,
top_p=1.0,
no_repeat_ngram_size=2,
max_new_tokens=256,
streamer=streamer,
use_cache=True,
)
"""グレーの壁に置かれた木製のテーブルの上に、茶色のタビーの猫が横たわっている。猫は右を向いており、頭は左を向き、尻尾は体の前に突き出ているように見える。テーブルは木製で、猫の後ろには黒い金属製の脚があり、テーブルの下には小さな緑の植物が置かれる。<EOD|LLM-jp>"""
```
## Training dataset
**Stage1 Pretrain**
- [LLaVA-Pretrain-JA](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA)
**Stage2 Fine-tuning**
- [LLaVA-JP-Instruct-108K](https://huggingface.co/datasets/toshi456/LLaVA-JP-Instruct-108K)
## Acknowledgement
- [LLaVA](https://llava-vl.github.io/)
- [LLM-jp](https://llm-jp.nii.ac.jp/)
- [scaling_on_scales](https://github.com/bfshi/scaling_on_scales/tree/master)
## License
Apache License 2.0 |