Update README
Browse files
README.md
CHANGED
@@ -1,3 +1,366 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
|
2 |
+
This is the official repository for the multi-modal large langauge model: **LaVIT**. The inference code of LaVIT can be found in [here](https://github.com/jy0205/LaVIT).
|
3 |
+
|
4 |
+
[[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
|
5 |
+
|
6 |
+
## News and Updates
|
7 |
+
* ```2023.10.17``` 🚀🚀🚀 We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
|
8 |
+
|
9 |
+
## Setup
|
10 |
+
|
11 |
+
### Requirements
|
12 |
+
|
13 |
+
The code for this repo is tested with PyTorch 1.13.1 and CUDA 11.7.
|
14 |
+
You should first install and configure the Pytorch Environment (including torch and torchvision) can then install the requirements with the following commands:
|
15 |
+
|
16 |
+
```shell
|
17 |
+
git clone https://github.com/jy0205/LaVIT.git
|
18 |
+
cd LaVIT
|
19 |
+
pip install -r requirements.txt
|
20 |
+
```
|
21 |
+
|
22 |
+
### Model Zoo
|
23 |
+
We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
|
24 |
+
> Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
|
25 |
+
|
26 |
+
LaVIT achieves the state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
|
27 |
+
|
28 |
+
#### Zero-shot Multi-modal Understanding
|
29 |
+
|
30 |
+
<table>
|
31 |
+
<thead align="center">
|
32 |
+
<tr>
|
33 |
+
<th rowspan="2">Model</th>
|
34 |
+
<th colspan="3">Image Captioning</th>
|
35 |
+
<th colspan="4">Visual Question Answering</th>
|
36 |
+
</tr>
|
37 |
+
<tr>
|
38 |
+
<th>COCO</th>
|
39 |
+
<th>NoCaps</th>
|
40 |
+
<th>Flickr30K</th>
|
41 |
+
<th>VQAv2</th>
|
42 |
+
<th>OK-VQA</th>
|
43 |
+
<th>GQA</th>
|
44 |
+
<th>VizWiz</th>
|
45 |
+
</tr>
|
46 |
+
</thead>
|
47 |
+
<tbody align="center">
|
48 |
+
<tr>
|
49 |
+
<td>Flamingo-3B</td>
|
50 |
+
<td>73.0</td>
|
51 |
+
<td>-</td>
|
52 |
+
<td>60.6</td>
|
53 |
+
<td>49.2</td>
|
54 |
+
<td>41.2</td>
|
55 |
+
<td>-</td>
|
56 |
+
<td>28.9</td>
|
57 |
+
</tr>
|
58 |
+
<tr>
|
59 |
+
<td>Flamingo-9B</td>
|
60 |
+
<td>79.4</td>
|
61 |
+
<td>-</td>
|
62 |
+
<td>61.5</td>
|
63 |
+
<td>51.8</td>
|
64 |
+
<td>44.7</td>
|
65 |
+
<td>-</td>
|
66 |
+
<td>28.8</td>
|
67 |
+
</tr>
|
68 |
+
<tr>
|
69 |
+
<td>OpenFlamingo-9B</td>
|
70 |
+
<td>79.5</td>
|
71 |
+
<td>-</td>
|
72 |
+
<td>59.5</td>
|
73 |
+
<td>52.7</td>
|
74 |
+
<td>37.8</td>
|
75 |
+
<td>-</td>
|
76 |
+
<td>27.5</td>
|
77 |
+
</tr>
|
78 |
+
<tr>
|
79 |
+
<td>MetaLM</td>
|
80 |
+
<td>82.2</td>
|
81 |
+
<td>-</td>
|
82 |
+
<td>43.4</td>
|
83 |
+
<td>41.1</td>
|
84 |
+
<td>11.4</td>
|
85 |
+
<td>-</td>
|
86 |
+
<td>-</td>
|
87 |
+
</tr>
|
88 |
+
<tr>
|
89 |
+
<td>Kosmos-1</td>
|
90 |
+
<td>84.7</td>
|
91 |
+
<td>-</td>
|
92 |
+
<td>67.1</td>
|
93 |
+
<td>51.0</td>
|
94 |
+
<td>-</td>
|
95 |
+
<td>-</td>
|
96 |
+
<td>29.2</td>
|
97 |
+
</tr>
|
98 |
+
<tr>
|
99 |
+
<td>Kosmos-2</td>
|
100 |
+
<td>-</td>
|
101 |
+
<td>-</td>
|
102 |
+
<td>80.5</td>
|
103 |
+
<td>51.1</td>
|
104 |
+
<td>-</td>
|
105 |
+
<td>-</td>
|
106 |
+
<td>-</td>
|
107 |
+
</tr>
|
108 |
+
<tr>
|
109 |
+
<td>BLIP-2 (Vicuna-7B)</td>
|
110 |
+
<td>-</td>
|
111 |
+
<td>107.5</td>
|
112 |
+
<td>74.9</td>
|
113 |
+
<td>-</td>
|
114 |
+
<td>-</td>
|
115 |
+
<td>41.3</td>
|
116 |
+
<td>25.3</td>
|
117 |
+
</tr>
|
118 |
+
<tr>
|
119 |
+
<td>BLIP-2 (Vicuna-13B)</td>
|
120 |
+
<td>-</td>
|
121 |
+
<td>103.9</td>
|
122 |
+
<td>71.6</td>
|
123 |
+
<td>65.0</td>
|
124 |
+
<td>45.9</td>
|
125 |
+
<td>61.0</td>
|
126 |
+
<td>19.6</td>
|
127 |
+
</tr>
|
128 |
+
<tr>
|
129 |
+
<td>CM3Leon-7B</td>
|
130 |
+
<td>61.6</td>
|
131 |
+
<td>-</td>
|
132 |
+
<td>-</td>
|
133 |
+
<td>47.6</td>
|
134 |
+
<td>-</td>
|
135 |
+
<td>-</td>
|
136 |
+
<td>37.6</td>
|
137 |
+
</tr>
|
138 |
+
<tr>
|
139 |
+
<td>Emu (LLaMA-1-13B)</td>
|
140 |
+
<td>112.4</td>
|
141 |
+
<td>-</td>
|
142 |
+
<td>-</td>
|
143 |
+
<td>52.0</td>
|
144 |
+
<td>38.2</td>
|
145 |
+
<td>-</td>
|
146 |
+
<td>34.2</td>
|
147 |
+
</tr>
|
148 |
+
<tr>
|
149 |
+
<td>LaVIT (LLaMA-1-7B)</td>
|
150 |
+
<td>134.0</td>
|
151 |
+
<td><b>114.2</b></td>
|
152 |
+
<td>83.0</td>
|
153 |
+
<td>66.0</td>
|
154 |
+
<td>54.6</td>
|
155 |
+
<td>46.8</td>
|
156 |
+
<td>38.5</td>
|
157 |
+
</tr>
|
158 |
+
<tr>
|
159 |
+
<td>LaVIT (LLaMA-2-7B)</td>
|
160 |
+
<td><b>134.6</b></td>
|
161 |
+
<td>113.1</td>
|
162 |
+
<td><b>83.2</b></td>
|
163 |
+
<td><b>68.2</b></td>
|
164 |
+
<td><b>55.7</b></td>
|
165 |
+
<td><b>48.0</b></td>
|
166 |
+
<td><b>45.3</b></td>
|
167 |
+
</tr>
|
168 |
+
</tbody>
|
169 |
+
</table>
|
170 |
+
|
171 |
+
#### Zero-shot Text-to-Image Generation
|
172 |
+
|
173 |
+
<table>
|
174 |
+
<thead>
|
175 |
+
<tr>
|
176 |
+
<th>Method</th>
|
177 |
+
<th>Model</th>
|
178 |
+
<th>Model type</th>
|
179 |
+
<th>FID</th>
|
180 |
+
</tr>
|
181 |
+
</thead>
|
182 |
+
<tbody align="center">
|
183 |
+
<tr>
|
184 |
+
<td rowspan="9">Text2Image Specialist</td>
|
185 |
+
<td>DALL-E</td>
|
186 |
+
<td>Autoregressive</td>
|
187 |
+
<td>28.0</td>
|
188 |
+
</tr>
|
189 |
+
<tr>
|
190 |
+
<td>CogView</td>
|
191 |
+
<td>Autoregressive</td>
|
192 |
+
<td>27.1</td>
|
193 |
+
</tr>
|
194 |
+
<tr>
|
195 |
+
<td>StableDiffusion</td>
|
196 |
+
<td>Diffusion</td>
|
197 |
+
<td>12.6</td>
|
198 |
+
</tr>
|
199 |
+
<tr>
|
200 |
+
<td>GLIDE</td>
|
201 |
+
<td>Diffusion</td>
|
202 |
+
<td>12.2</td>
|
203 |
+
</tr>
|
204 |
+
<tr>
|
205 |
+
<td>DALL-E 2</td>
|
206 |
+
<td>Diffusion</td>
|
207 |
+
<td>10.4</td>
|
208 |
+
</tr>
|
209 |
+
<tr>
|
210 |
+
<td>Make-A-Scene</td>
|
211 |
+
<td>Autoregressive</td>
|
212 |
+
<td>11.8</td>
|
213 |
+
</tr>
|
214 |
+
<tr>
|
215 |
+
<td>MUSE-7.6B</td>
|
216 |
+
<td>Non-Autoregressive</td>
|
217 |
+
<td>7.9</td>
|
218 |
+
</tr>
|
219 |
+
<tr>
|
220 |
+
<td>Imagen-3.4B</td>
|
221 |
+
<td>Diffusion</td>
|
222 |
+
<td>7.3</td>
|
223 |
+
</tr>
|
224 |
+
<tr>
|
225 |
+
<td>Parti-20B</td>
|
226 |
+
<td>Autoregressive</td>
|
227 |
+
<td><b>7.2</b></td>
|
228 |
+
</tr>
|
229 |
+
<tr>
|
230 |
+
<td rowspan="5">Multimodal Large Langauge Model</td>
|
231 |
+
<td>GILL (OPT-6.7B)</td>
|
232 |
+
<td>LLM</td>
|
233 |
+
<td>12.2</td>
|
234 |
+
</tr>
|
235 |
+
<tr>
|
236 |
+
<td>Emu (LLaMA-1-13B)</td>
|
237 |
+
<td>LLM</td>
|
238 |
+
<td>11.7</td>
|
239 |
+
</tr>
|
240 |
+
<tr>
|
241 |
+
<td>CM3Leon-7B </td>
|
242 |
+
<td>LLM</td>
|
243 |
+
<td>10.8</td>
|
244 |
+
</tr>
|
245 |
+
<tr>
|
246 |
+
<td>LaVIT (LLaMA-1-7B)</td>
|
247 |
+
<td>LLM</td>
|
248 |
+
<td>7.4</td>
|
249 |
+
</tr>
|
250 |
+
<tr>
|
251 |
+
<td>LaVIT (LLaMA-2-7B)</td>
|
252 |
+
<td>LLM</td>
|
253 |
+
<td><b>7.2</b></td>
|
254 |
+
</tr>
|
255 |
+
</tbody>
|
256 |
+
</table>
|
257 |
+
|
258 |
+
## Usage
|
259 |
+
LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some example. Only a few lines of codes are needed to use **LaVIT** for inference. We also provide the detailed examples in the jupyter notebooks: `understanding.ipynb` and `generation.ipynb`. You can refer them for learning how to interact with LaVIT.
|
260 |
+
|
261 |
+
### Multi-modal Understanding
|
262 |
+
|
263 |
+
```python
|
264 |
+
import os
|
265 |
+
import random
|
266 |
+
import torch
|
267 |
+
import torch.nn as nn
|
268 |
+
from models import build_model
|
269 |
+
from PIL import Image
|
270 |
+
|
271 |
+
random.seed(42)
|
272 |
+
torch.manual_seed(42)
|
273 |
+
|
274 |
+
# The local directory you save the LaVIT pre-trained weight
|
275 |
+
model_path = '/path/LaVIT_weight'
|
276 |
+
|
277 |
+
# Using BFloat16 during inference
|
278 |
+
model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
|
279 |
+
|
280 |
+
# Inference using GPU-0
|
281 |
+
device_id = 0
|
282 |
+
torch.cuda.set_device(device_id)
|
283 |
+
device = torch.device('cuda')
|
284 |
+
|
285 |
+
# Building LaVIT for understanding and load its weight from huggingface
|
286 |
+
model = build_model(model_path=model_path, model_dtype=model_dtype,
|
287 |
+
device_id=device_id, use_xformers=False, understanding=True)
|
288 |
+
model = model.to(device)
|
289 |
+
|
290 |
+
# Image Captioning
|
291 |
+
image_path = 'demo/caption_image.jpg'
|
292 |
+
caption = model.generate({"image": image_path})[0]
|
293 |
+
print(caption)
|
294 |
+
# an old photo of a horse and buggy in front of a building
|
295 |
+
|
296 |
+
# Visual Question Answering
|
297 |
+
image_path = 'demo/qa_image.jpg'
|
298 |
+
question = "What's that drink in the glass?"
|
299 |
+
answer = model.predict_answers({"image": image_path, "text_input": question}, max_len=10)[0]
|
300 |
+
print("The answer is: ", answer)
|
301 |
+
# The answer is: orange juice
|
302 |
+
```
|
303 |
+
|
304 |
+
### Multi-modal generation
|
305 |
+
|
306 |
+
For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We recommend to set `guidance_scale_for_llm=3.0` by default, you can increase this scale (e.g., 4.0 or 5.0) for encouraging the generated image to follow the semantics of given prompts.
|
307 |
+
|
308 |
+
```python
|
309 |
+
import os
|
310 |
+
import torch
|
311 |
+
import torch.nn as nn
|
312 |
+
from models import build_model
|
313 |
+
from PIL import Image
|
314 |
+
|
315 |
+
torch.manual_seed(42)
|
316 |
+
|
317 |
+
# The local directory you save the LaVIT pre-trained weight
|
318 |
+
model_path = '/path/LaVIT_weight'
|
319 |
+
|
320 |
+
# Using BFloat16 during inference
|
321 |
+
model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
|
322 |
+
|
323 |
+
# Inference using GPU-0
|
324 |
+
device_id = 0
|
325 |
+
torch.cuda.set_device(device_id)
|
326 |
+
device = torch.device('cuda')
|
327 |
+
torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
|
328 |
+
|
329 |
+
# Building LaVIT for Generation and load the weight from huggingface
|
330 |
+
model = build_model(model_path=model_path, model_dtype=model_dtype,
|
331 |
+
device_id=device_id, use_xformers=False, understanding=False)
|
332 |
+
model = model.to(device)
|
333 |
+
|
334 |
+
# Text-to-Image Generation
|
335 |
+
prompt = "a sculpture of a duck made of wool"
|
336 |
+
with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
337 |
+
image = model.generate_image(prompt, guidance_scale_for_llm=3.0, num_return_images=1)[0]
|
338 |
+
image.save("output/i2t_output.jpg")
|
339 |
+
|
340 |
+
# Multi-modal Image synthesis
|
341 |
+
image_prompt = 'demo/dog.jpg'
|
342 |
+
text_prompt = 'It is running in the snow'
|
343 |
+
input_prompts = [(image_prompt, 'image'), (text_prompt, 'text')]
|
344 |
+
with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
345 |
+
image = model.multimodal_synthesis(input_prompts, guidance_scale_for_llm=5.0, num_return_images=1)[0]
|
346 |
+
image.save("output/it2i_output.jpg")
|
347 |
+
```
|
348 |
+
|
349 |
+
## Acknowledgement
|
350 |
+
We are grateful for the following awesome projects when implementing LaVIT:
|
351 |
+
* [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
|
352 |
+
* [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
|
353 |
+
* [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
|
354 |
+
* [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
|
355 |
+
|
356 |
+
|
357 |
+
## <a name="Citing"></a>Citation
|
358 |
+
Consider giving this repository a star and cite LaVIT in your publications if it helps your research.
|
359 |
+
|
360 |
+
```
|
361 |
+
@article{jin2023unified,
|
362 |
+
title={Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization},
|
363 |
+
author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
|
364 |
+
journal={arXiv preprint arXiv:2309.04669},
|
365 |
+
year={2023}
|
366 |
+
}
|