rain1011 commited on
Commit
868f738
1 Parent(s): 5189b18

Update README

Browse files
Files changed (1) hide show
  1. README.md +366 -3
README.md CHANGED
@@ -1,3 +1,366 @@
1
- ---
2
- license: llama2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
2
+ This is the official repository for the multi-modal large langauge model: **LaVIT**. The inference code of LaVIT can be found in [here](https://github.com/jy0205/LaVIT).
3
+
4
+ [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
5
+
6
+ ## News and Updates
7
+ * ```2023.10.17``` 🚀🚀🚀 We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
8
+
9
+ ## Setup
10
+
11
+ ### Requirements
12
+
13
+ The code for this repo is tested with PyTorch 1.13.1 and CUDA 11.7.
14
+ You should first install and configure the Pytorch Environment (including torch and torchvision) can then install the requirements with the following commands:
15
+
16
+ ```shell
17
+ git clone https://github.com/jy0205/LaVIT.git
18
+ cd LaVIT
19
+ pip install -r requirements.txt
20
+ ```
21
+
22
+ ### Model Zoo
23
+ We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
24
+ > Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
25
+
26
+ LaVIT achieves the state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
27
+
28
+ #### Zero-shot Multi-modal Understanding
29
+
30
+ <table>
31
+ <thead align="center">
32
+ <tr>
33
+ <th rowspan="2">Model</th>
34
+ <th colspan="3">Image Captioning</th>
35
+ <th colspan="4">Visual Question Answering</th>
36
+ </tr>
37
+ <tr>
38
+ <th>COCO</th>
39
+ <th>NoCaps</th>
40
+ <th>Flickr30K</th>
41
+ <th>VQAv2</th>
42
+ <th>OK-VQA</th>
43
+ <th>GQA</th>
44
+ <th>VizWiz</th>
45
+ </tr>
46
+ </thead>
47
+ <tbody align="center">
48
+ <tr>
49
+ <td>Flamingo-3B</td>
50
+ <td>73.0</td>
51
+ <td>-</td>
52
+ <td>60.6</td>
53
+ <td>49.2</td>
54
+ <td>41.2</td>
55
+ <td>-</td>
56
+ <td>28.9</td>
57
+ </tr>
58
+ <tr>
59
+ <td>Flamingo-9B</td>
60
+ <td>79.4</td>
61
+ <td>-</td>
62
+ <td>61.5</td>
63
+ <td>51.8</td>
64
+ <td>44.7</td>
65
+ <td>-</td>
66
+ <td>28.8</td>
67
+ </tr>
68
+ <tr>
69
+ <td>OpenFlamingo-9B</td>
70
+ <td>79.5</td>
71
+ <td>-</td>
72
+ <td>59.5</td>
73
+ <td>52.7</td>
74
+ <td>37.8</td>
75
+ <td>-</td>
76
+ <td>27.5</td>
77
+ </tr>
78
+ <tr>
79
+ <td>MetaLM</td>
80
+ <td>82.2</td>
81
+ <td>-</td>
82
+ <td>43.4</td>
83
+ <td>41.1</td>
84
+ <td>11.4</td>
85
+ <td>-</td>
86
+ <td>-</td>
87
+ </tr>
88
+ <tr>
89
+ <td>Kosmos-1</td>
90
+ <td>84.7</td>
91
+ <td>-</td>
92
+ <td>67.1</td>
93
+ <td>51.0</td>
94
+ <td>-</td>
95
+ <td>-</td>
96
+ <td>29.2</td>
97
+ </tr>
98
+ <tr>
99
+ <td>Kosmos-2</td>
100
+ <td>-</td>
101
+ <td>-</td>
102
+ <td>80.5</td>
103
+ <td>51.1</td>
104
+ <td>-</td>
105
+ <td>-</td>
106
+ <td>-</td>
107
+ </tr>
108
+ <tr>
109
+ <td>BLIP-2 (Vicuna-7B)</td>
110
+ <td>-</td>
111
+ <td>107.5</td>
112
+ <td>74.9</td>
113
+ <td>-</td>
114
+ <td>-</td>
115
+ <td>41.3</td>
116
+ <td>25.3</td>
117
+ </tr>
118
+ <tr>
119
+ <td>BLIP-2 (Vicuna-13B)</td>
120
+ <td>-</td>
121
+ <td>103.9</td>
122
+ <td>71.6</td>
123
+ <td>65.0</td>
124
+ <td>45.9</td>
125
+ <td>61.0</td>
126
+ <td>19.6</td>
127
+ </tr>
128
+ <tr>
129
+ <td>CM3Leon-7B</td>
130
+ <td>61.6</td>
131
+ <td>-</td>
132
+ <td>-</td>
133
+ <td>47.6</td>
134
+ <td>-</td>
135
+ <td>-</td>
136
+ <td>37.6</td>
137
+ </tr>
138
+ <tr>
139
+ <td>Emu (LLaMA-1-13B)</td>
140
+ <td>112.4</td>
141
+ <td>-</td>
142
+ <td>-</td>
143
+ <td>52.0</td>
144
+ <td>38.2</td>
145
+ <td>-</td>
146
+ <td>34.2</td>
147
+ </tr>
148
+ <tr>
149
+ <td>LaVIT (LLaMA-1-7B)</td>
150
+ <td>134.0</td>
151
+ <td><b>114.2</b></td>
152
+ <td>83.0</td>
153
+ <td>66.0</td>
154
+ <td>54.6</td>
155
+ <td>46.8</td>
156
+ <td>38.5</td>
157
+ </tr>
158
+ <tr>
159
+ <td>LaVIT (LLaMA-2-7B)</td>
160
+ <td><b>134.6</b></td>
161
+ <td>113.1</td>
162
+ <td><b>83.2</b></td>
163
+ <td><b>68.2</b></td>
164
+ <td><b>55.7</b></td>
165
+ <td><b>48.0</b></td>
166
+ <td><b>45.3</b></td>
167
+ </tr>
168
+ </tbody>
169
+ </table>
170
+
171
+ #### Zero-shot Text-to-Image Generation
172
+
173
+ <table>
174
+ <thead>
175
+ <tr>
176
+ <th>Method</th>
177
+ <th>Model</th>
178
+ <th>Model type</th>
179
+ <th>FID</th>
180
+ </tr>
181
+ </thead>
182
+ <tbody align="center">
183
+ <tr>
184
+ <td rowspan="9">Text2Image Specialist</td>
185
+ <td>DALL-E</td>
186
+ <td>Autoregressive</td>
187
+ <td>28.0</td>
188
+ </tr>
189
+ <tr>
190
+ <td>CogView</td>
191
+ <td>Autoregressive</td>
192
+ <td>27.1</td>
193
+ </tr>
194
+ <tr>
195
+ <td>StableDiffusion</td>
196
+ <td>Diffusion</td>
197
+ <td>12.6</td>
198
+ </tr>
199
+ <tr>
200
+ <td>GLIDE</td>
201
+ <td>Diffusion</td>
202
+ <td>12.2</td>
203
+ </tr>
204
+ <tr>
205
+ <td>DALL-E 2</td>
206
+ <td>Diffusion</td>
207
+ <td>10.4</td>
208
+ </tr>
209
+ <tr>
210
+ <td>Make-A-Scene</td>
211
+ <td>Autoregressive</td>
212
+ <td>11.8</td>
213
+ </tr>
214
+ <tr>
215
+ <td>MUSE-7.6B</td>
216
+ <td>Non-Autoregressive</td>
217
+ <td>7.9</td>
218
+ </tr>
219
+ <tr>
220
+ <td>Imagen-3.4B</td>
221
+ <td>Diffusion</td>
222
+ <td>7.3</td>
223
+ </tr>
224
+ <tr>
225
+ <td>Parti-20B</td>
226
+ <td>Autoregressive</td>
227
+ <td><b>7.2</b></td>
228
+ </tr>
229
+ <tr>
230
+ <td rowspan="5">Multimodal Large Langauge Model</td>
231
+ <td>GILL (OPT-6.7B)</td>
232
+ <td>LLM</td>
233
+ <td>12.2</td>
234
+ </tr>
235
+ <tr>
236
+ <td>Emu (LLaMA-1-13B)</td>
237
+ <td>LLM</td>
238
+ <td>11.7</td>
239
+ </tr>
240
+ <tr>
241
+ <td>CM3Leon-7B </td>
242
+ <td>LLM</td>
243
+ <td>10.8</td>
244
+ </tr>
245
+ <tr>
246
+ <td>LaVIT (LLaMA-1-7B)</td>
247
+ <td>LLM</td>
248
+ <td>7.4</td>
249
+ </tr>
250
+ <tr>
251
+ <td>LaVIT (LLaMA-2-7B)</td>
252
+ <td>LLM</td>
253
+ <td><b>7.2</b></td>
254
+ </tr>
255
+ </tbody>
256
+ </table>
257
+
258
+ ## Usage
259
+ LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some example. Only a few lines of codes are needed to use **LaVIT** for inference. We also provide the detailed examples in the jupyter notebooks: `understanding.ipynb` and `generation.ipynb`. You can refer them for learning how to interact with LaVIT.
260
+
261
+ ### Multi-modal Understanding
262
+
263
+ ```python
264
+ import os
265
+ import random
266
+ import torch
267
+ import torch.nn as nn
268
+ from models import build_model
269
+ from PIL import Image
270
+
271
+ random.seed(42)
272
+ torch.manual_seed(42)
273
+
274
+ # The local directory you save the LaVIT pre-trained weight
275
+ model_path = '/path/LaVIT_weight'
276
+
277
+ # Using BFloat16 during inference
278
+ model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
279
+
280
+ # Inference using GPU-0
281
+ device_id = 0
282
+ torch.cuda.set_device(device_id)
283
+ device = torch.device('cuda')
284
+
285
+ # Building LaVIT for understanding and load its weight from huggingface
286
+ model = build_model(model_path=model_path, model_dtype=model_dtype,
287
+ device_id=device_id, use_xformers=False, understanding=True)
288
+ model = model.to(device)
289
+
290
+ # Image Captioning
291
+ image_path = 'demo/caption_image.jpg'
292
+ caption = model.generate({"image": image_path})[0]
293
+ print(caption)
294
+ # an old photo of a horse and buggy in front of a building
295
+
296
+ # Visual Question Answering
297
+ image_path = 'demo/qa_image.jpg'
298
+ question = "What's that drink in the glass?"
299
+ answer = model.predict_answers({"image": image_path, "text_input": question}, max_len=10)[0]
300
+ print("The answer is: ", answer)
301
+ # The answer is: orange juice
302
+ ```
303
+
304
+ ### Multi-modal generation
305
+
306
+ For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We recommend to set `guidance_scale_for_llm=3.0` by default, you can increase this scale (e.g., 4.0 or 5.0) for encouraging the generated image to follow the semantics of given prompts.
307
+
308
+ ```python
309
+ import os
310
+ import torch
311
+ import torch.nn as nn
312
+ from models import build_model
313
+ from PIL import Image
314
+
315
+ torch.manual_seed(42)
316
+
317
+ # The local directory you save the LaVIT pre-trained weight
318
+ model_path = '/path/LaVIT_weight'
319
+
320
+ # Using BFloat16 during inference
321
+ model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
322
+
323
+ # Inference using GPU-0
324
+ device_id = 0
325
+ torch.cuda.set_device(device_id)
326
+ device = torch.device('cuda')
327
+ torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
328
+
329
+ # Building LaVIT for Generation and load the weight from huggingface
330
+ model = build_model(model_path=model_path, model_dtype=model_dtype,
331
+ device_id=device_id, use_xformers=False, understanding=False)
332
+ model = model.to(device)
333
+
334
+ # Text-to-Image Generation
335
+ prompt = "a sculpture of a duck made of wool"
336
+ with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
337
+ image = model.generate_image(prompt, guidance_scale_for_llm=3.0, num_return_images=1)[0]
338
+ image.save("output/i2t_output.jpg")
339
+
340
+ # Multi-modal Image synthesis
341
+ image_prompt = 'demo/dog.jpg'
342
+ text_prompt = 'It is running in the snow'
343
+ input_prompts = [(image_prompt, 'image'), (text_prompt, 'text')]
344
+ with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
345
+ image = model.multimodal_synthesis(input_prompts, guidance_scale_for_llm=5.0, num_return_images=1)[0]
346
+ image.save("output/it2i_output.jpg")
347
+ ```
348
+
349
+ ## Acknowledgement
350
+ We are grateful for the following awesome projects when implementing LaVIT:
351
+ * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
352
+ * [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
353
+ * [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
354
+ * [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
355
+
356
+
357
+ ## <a name="Citing"></a>Citation
358
+ Consider giving this repository a star and cite LaVIT in your publications if it helps your research.
359
+
360
+ ```
361
+ @article{jin2023unified,
362
+ title={Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization},
363
+ author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
364
+ journal={arXiv preprint arXiv:2309.04669},
365
+ year={2023}
366
+ }