Update README.md
Browse files
README.md
CHANGED
@@ -3,12 +3,13 @@ license: llama2
|
|
3 |
pipeline_tag: text-to-image
|
4 |
---
|
5 |
# LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
|
6 |
-
This is the official repository for the multi-modal large
|
7 |
|
8 |
[[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
|
9 |
|
10 |
## News and Updates
|
11 |
* ```2023.10.17``` πππ We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
|
|
|
12 |
|
13 |
## Setup
|
14 |
|
@@ -23,11 +24,13 @@ cd LaVIT
|
|
23 |
pip install -r requirements.txt
|
24 |
```
|
25 |
|
|
|
|
|
26 |
### Model Zoo
|
27 |
We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
|
28 |
> Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
|
29 |
|
30 |
-
LaVIT
|
31 |
|
32 |
#### Zero-shot Multi-modal Understanding
|
33 |
|
@@ -124,9 +127,9 @@ LaVIT achieves the state-of-the-arts performance on various multi-modal downstre
|
|
124 |
<td>-</td>
|
125 |
<td>103.9</td>
|
126 |
<td>71.6</td>
|
127 |
-
<td
|
128 |
-
<td
|
129 |
-
<td>
|
130 |
<td>19.6</td>
|
131 |
</tr>
|
132 |
<tr>
|
@@ -260,7 +263,11 @@ LaVIT achieves the state-of-the-arts performance on various multi-modal downstre
|
|
260 |
</table>
|
261 |
|
262 |
## Usage
|
263 |
-
LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some
|
|
|
|
|
|
|
|
|
264 |
|
265 |
### Multi-modal Understanding
|
266 |
|
@@ -275,7 +282,8 @@ from PIL import Image
|
|
275 |
random.seed(42)
|
276 |
torch.manual_seed(42)
|
277 |
|
278 |
-
# The local directory you save the LaVIT pre-trained weight
|
|
|
279 |
model_path = '/path/LaVIT_weight'
|
280 |
|
281 |
# Using BFloat16 during inference
|
@@ -305,9 +313,9 @@ print("The answer is: ", answer)
|
|
305 |
# The answer is: orange juice
|
306 |
```
|
307 |
|
308 |
-
###
|
309 |
|
310 |
-
For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We
|
311 |
|
312 |
```python
|
313 |
import os
|
@@ -316,9 +324,8 @@ import torch.nn as nn
|
|
316 |
from models import build_model
|
317 |
from PIL import Image
|
318 |
|
319 |
-
|
320 |
-
|
321 |
-
# The local directory you save the LaVIT pre-trained weight
|
322 |
model_path = '/path/LaVIT_weight'
|
323 |
|
324 |
# Using BFloat16 during inference
|
@@ -331,31 +338,45 @@ device = torch.device('cuda')
|
|
331 |
torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
|
332 |
|
333 |
# Building LaVIT for Generation and load the weight from huggingface
|
334 |
-
|
335 |
-
|
|
|
336 |
model = model.to(device)
|
337 |
|
338 |
# Text-to-Image Generation
|
339 |
prompt = "a sculpture of a duck made of wool"
|
340 |
-
with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
341 |
-
image = model.generate_image(prompt, guidance_scale_for_llm=3.0, num_return_images=1)[0]
|
342 |
-
image.save("output/i2t_output.jpg")
|
343 |
|
344 |
-
#
|
345 |
-
|
346 |
-
|
347 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
348 |
with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
349 |
-
|
350 |
-
|
|
|
|
|
351 |
```
|
352 |
|
|
|
|
|
|
|
353 |
## Acknowledgement
|
354 |
We are grateful for the following awesome projects when implementing LaVIT:
|
355 |
* [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
|
356 |
* [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
|
357 |
* [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
|
358 |
* [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
|
|
|
359 |
|
360 |
|
361 |
## <a name="Citing"></a>Citation
|
@@ -367,4 +388,4 @@ Consider giving this repository a star and cite LaVIT in your publications if it
|
|
367 |
author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
|
368 |
journal={arXiv preprint arXiv:2309.04669},
|
369 |
year={2023}
|
370 |
-
}
|
|
|
3 |
pipeline_tag: text-to-image
|
4 |
---
|
5 |
# LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
|
6 |
+
This is the official repository for the multi-modal large language model: **LaVIT**. The inference code of LaVIT can be found in [here](https://github.com/jy0205/LaVIT).
|
7 |
|
8 |
[[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
|
9 |
|
10 |
## News and Updates
|
11 |
* ```2023.10.17``` πππ We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
|
12 |
+
* ```2023.10.31``` πππ We update the high-resolution pixel decoder in **LaVIT**, which supports to generate high resolution (1024 * 1024 pixels), muliple aspect ratios (1:1, 4:3, 3:2, 16:9 ...) and high aesthetics images. The quality of generated images have been improved significantly.
|
13 |
|
14 |
## Setup
|
15 |
|
|
|
24 |
pip install -r requirements.txt
|
25 |
```
|
26 |
|
27 |
+
* (Optional) We recommend to use memory efficient attention by installing xFormers following the instructions in [here](https://huggingface.co/docs/diffusers/main/en/optimization/xformers). Then, you can set the argument `use_xformers=True` in `build_model` function to save the GPU memory and speed up inference.
|
28 |
+
|
29 |
### Model Zoo
|
30 |
We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
|
31 |
> Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
|
32 |
|
33 |
+
The pre-trained weight of LaVIT can be found on the huggingface from [here](https://huggingface.co/rain1011/LaVIT-7B-v1), which will take around 22GB of disk space. LaVIT achieves state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
|
34 |
|
35 |
#### Zero-shot Multi-modal Understanding
|
36 |
|
|
|
127 |
<td>-</td>
|
128 |
<td>103.9</td>
|
129 |
<td>71.6</td>
|
130 |
+
<td>-</td>
|
131 |
+
<td>-</td>
|
132 |
+
<td>32.3</td>
|
133 |
<td>19.6</td>
|
134 |
</tr>
|
135 |
<tr>
|
|
|
263 |
</table>
|
264 |
|
265 |
## Usage
|
266 |
+
LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some examples. Only a few lines of code are needed to use **LaVIT** for inference. We also provide the detailed examples in the following jupyter notebooks for learning how to interact with LaVIT.
|
267 |
+
|
268 |
+
* `understanding.ipynb` : examples for multi-modal understanding
|
269 |
+
* `text2image_synthesis.ipynb`: examples for the text-to-image generation.
|
270 |
+
* `multimodal_synthesis.ipynb`: examples for image synthesis with multi-modal prompts.
|
271 |
|
272 |
### Multi-modal Understanding
|
273 |
|
|
|
282 |
random.seed(42)
|
283 |
torch.manual_seed(42)
|
284 |
|
285 |
+
# The local directory you save the LaVIT pre-trained weight,
|
286 |
+
# it will automatically download the checkpoint from huggingface
|
287 |
model_path = '/path/LaVIT_weight'
|
288 |
|
289 |
# Using BFloat16 during inference
|
|
|
313 |
# The answer is: orange juice
|
314 |
```
|
315 |
|
316 |
+
### Text-to-Image Synthesis
|
317 |
|
318 |
+
For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We set `guidance_scale_for_llm=4.0` by default, you can increase this scale (e.g., 5.0 or 6.0) to encourage the generated image to follow the semantics of given prompts. Besides, you can modify the `ratio` to enable to generate the images with different aspect ratios.
|
319 |
|
320 |
```python
|
321 |
import os
|
|
|
324 |
from models import build_model
|
325 |
from PIL import Image
|
326 |
|
327 |
+
# The local directory you save the LaVIT pre-trained weight,
|
328 |
+
# it will automatically download the checkpoint from huggingface
|
|
|
329 |
model_path = '/path/LaVIT_weight'
|
330 |
|
331 |
# Using BFloat16 during inference
|
|
|
338 |
torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
|
339 |
|
340 |
# Building LaVIT for Generation and load the weight from huggingface
|
341 |
+
# You can set `use_xformers=True` if have installed xformers to save GPU mempry and speed up
|
342 |
+
model = build_model(model_path=model_path, model_dtype=model_dtype, device_id=device_id,
|
343 |
+
use_xformers=False, understanding=False, load_tokenizer=False)
|
344 |
model = model.to(device)
|
345 |
|
346 |
# Text-to-Image Generation
|
347 |
prompt = "a sculpture of a duck made of wool"
|
|
|
|
|
|
|
348 |
|
349 |
+
# LaVIT support 6 different image aspect ratios
|
350 |
+
ratio_dict = {
|
351 |
+
'1:1' : (1024, 1024),
|
352 |
+
'4:3' : (896, 1152),
|
353 |
+
'3:2' : (832, 1216),
|
354 |
+
'16:9' : (768, 1344),
|
355 |
+
'2:3' : (1216, 832),
|
356 |
+
'3:4' : (1152, 896),
|
357 |
+
}
|
358 |
+
|
359 |
+
# The image aspect ratio you want to generate
|
360 |
+
ratio = '1:1'
|
361 |
+
height, width = ratio_dict[ratio]
|
362 |
+
|
363 |
with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
364 |
+
images = model.generate_image(prompt, width=width, height=height,
|
365 |
+
num_return_images=1, guidance_scale_for_llm=4.0, num_inference_steps=50)
|
366 |
+
|
367 |
+
images[0].save("output/i2t_output.jpg")
|
368 |
```
|
369 |
|
370 |
+
## Evaluation
|
371 |
+
The batch evaluation code with multiple GPUs on the adopted multi-modal benchmarks will be released in the following days.
|
372 |
+
|
373 |
## Acknowledgement
|
374 |
We are grateful for the following awesome projects when implementing LaVIT:
|
375 |
* [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
|
376 |
* [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
|
377 |
* [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
|
378 |
* [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
|
379 |
+
* [Diffusers](https://github.com/huggingface/diffusers): State-of-the-art diffusion models for image and audio generation in PyTorch.
|
380 |
|
381 |
|
382 |
## <a name="Citing"></a>Citation
|
|
|
388 |
author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
|
389 |
journal={arXiv preprint arXiv:2309.04669},
|
390 |
year={2023}
|
391 |
+
}
|