leonardPKU
commited on
Commit
•
310593d
1
Parent(s):
4dd2f13
Create README.md
Browse files
README.md
CHANGED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# YING-VLM
|
2 |
+
|
3 |
+
We open-sourced the trained checkpoint and infernce code of [YING-VLM](https://huggingface.co/MMInstruction/YingVLM) at huggingface, which is trained on [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) dataset.
|
4 |
+
|
5 |
+
|
6 |
+
# Example of Using YING-VLM
|
7 |
+
|
8 |
+
Please install the following packages:
|
9 |
+
- torch==2.0.0
|
10 |
+
- transformers==4.31.0
|
11 |
+
|
12 |
+
|
13 |
+
|
14 |
+
Infernce example:
|
15 |
+
|
16 |
+
```python
|
17 |
+
from transformers import AutoProcessor, AutoTokenizer
|
18 |
+
from PIL import Image
|
19 |
+
import torch
|
20 |
+
|
21 |
+
from modelingYING import VLMForConditionalGeneration
|
22 |
+
|
23 |
+
|
24 |
+
# set device
|
25 |
+
device="cuda:0"
|
26 |
+
|
27 |
+
# set prompt template
|
28 |
+
prompt_template = """
|
29 |
+
<human>:
|
30 |
+
{instruction}
|
31 |
+
{input}
|
32 |
+
<bot>:
|
33 |
+
"""
|
34 |
+
|
35 |
+
# load processor and tokenizer
|
36 |
+
processor = AutoProcessor.from_pretrained("MMInstruction/YingVLM")
|
37 |
+
tokenizer = AutoTokenizer.from_pretrained("MMInstruction/YingVLM") # ziya is not available right now
|
38 |
+
|
39 |
+
|
40 |
+
# load model
|
41 |
+
model = VLMForConditionalGeneration.from_pretrained("MMInstruction/YingVLM")
|
42 |
+
model.to(device,dtype=torch.float16)
|
43 |
+
|
44 |
+
|
45 |
+
# prepare input
|
46 |
+
image = Image.open("./imgs/night_house.jpeg")
|
47 |
+
instruction = "Scrutinize the given image and answer the connected question."
|
48 |
+
input = "What is the color of the couch?"
|
49 |
+
prompt = prompt_template.format(instruction=instruction, input=input)
|
50 |
+
|
51 |
+
|
52 |
+
# inference
|
53 |
+
inputs = processor(images=image, return_tensors="pt").to(device, torch.float16)
|
54 |
+
text_inputs = tokenizer(prompt, return_tensors="pt")
|
55 |
+
inputs.update(text_inputs)
|
56 |
+
|
57 |
+
|
58 |
+
|
59 |
+
generated_ids = model.generate(**{k: v.to(device) for k, v in inputs.items()}, img_num=1, max_new_tokens=128, do_sample=False)
|
60 |
+
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].split("\n")[0] # \n is the end token
|
61 |
+
|
62 |
+
print(generated_text)
|
63 |
+
# The couch in the living room is green.
|
64 |
+
|
65 |
+
|
66 |
+
|
67 |
+
|
68 |
+
|
69 |
+
```
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
+
# Refernce
|
74 |
+
|
75 |
+
If you find our work useful, please kindly cite
|
76 |
+
```bib
|
77 |
+
@article{li2023m3it,
|
78 |
+
title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning},
|
79 |
+
author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu},
|
80 |
+
journal={arXiv preprint arXiv:2306.04387},
|
81 |
+
year={2023}
|
82 |
+
}
|
83 |
+
```
|