rain1011 commited on
Commit
7549489
β€’
1 Parent(s): 13318dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -24
README.md CHANGED
@@ -3,12 +3,13 @@ license: llama2
3
  pipeline_tag: text-to-image
4
  ---
5
  # LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
6
- This is the official repository for the multi-modal large langauge model: **LaVIT**. The inference code of LaVIT can be found in [here](https://github.com/jy0205/LaVIT).
7
 
8
  [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
9
 
10
  ## News and Updates
11
  * ```2023.10.17``` πŸš€πŸš€πŸš€ We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
 
12
 
13
  ## Setup
14
 
@@ -23,11 +24,13 @@ cd LaVIT
23
  pip install -r requirements.txt
24
  ```
25
 
 
 
26
  ### Model Zoo
27
  We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
28
  > Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
29
 
30
- LaVIT achieves the state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
31
 
32
  #### Zero-shot Multi-modal Understanding
33
 
@@ -124,9 +127,9 @@ LaVIT achieves the state-of-the-arts performance on various multi-modal downstre
124
  <td>-</td>
125
  <td>103.9</td>
126
  <td>71.6</td>
127
- <td>65.0</td>
128
- <td>45.9</td>
129
- <td>61.0</td>
130
  <td>19.6</td>
131
  </tr>
132
  <tr>
@@ -260,7 +263,11 @@ LaVIT achieves the state-of-the-arts performance on various multi-modal downstre
260
  </table>
261
 
262
  ## Usage
263
- LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some example. Only a few lines of codes are needed to use **LaVIT** for inference. We also provide the detailed examples in the jupyter notebooks: `understanding.ipynb` and `generation.ipynb`. You can refer them for learning how to interact with LaVIT.
 
 
 
 
264
 
265
  ### Multi-modal Understanding
266
 
@@ -275,7 +282,8 @@ from PIL import Image
275
  random.seed(42)
276
  torch.manual_seed(42)
277
 
278
- # The local directory you save the LaVIT pre-trained weight
 
279
  model_path = '/path/LaVIT_weight'
280
 
281
  # Using BFloat16 during inference
@@ -305,9 +313,9 @@ print("The answer is: ", answer)
305
  # The answer is: orange juice
306
  ```
307
 
308
- ### Multi-modal generation
309
 
310
- For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We recommend to set `guidance_scale_for_llm=3.0` by default, you can increase this scale (e.g., 4.0 or 5.0) for encouraging the generated image to follow the semantics of given prompts.
311
 
312
  ```python
313
  import os
@@ -316,9 +324,8 @@ import torch.nn as nn
316
  from models import build_model
317
  from PIL import Image
318
 
319
- torch.manual_seed(42)
320
-
321
- # The local directory you save the LaVIT pre-trained weight
322
  model_path = '/path/LaVIT_weight'
323
 
324
  # Using BFloat16 during inference
@@ -331,31 +338,45 @@ device = torch.device('cuda')
331
  torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
332
 
333
  # Building LaVIT for Generation and load the weight from huggingface
334
- model = build_model(model_path=model_path, model_dtype=model_dtype,
335
- device_id=device_id, use_xformers=False, understanding=False)
 
336
  model = model.to(device)
337
 
338
  # Text-to-Image Generation
339
  prompt = "a sculpture of a duck made of wool"
340
- with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
341
- image = model.generate_image(prompt, guidance_scale_for_llm=3.0, num_return_images=1)[0]
342
- image.save("output/i2t_output.jpg")
343
 
344
- # Multi-modal Image synthesis
345
- image_prompt = 'demo/dog.jpg'
346
- text_prompt = 'It is running in the snow'
347
- input_prompts = [(image_prompt, 'image'), (text_prompt, 'text')]
 
 
 
 
 
 
 
 
 
 
348
  with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
349
- image = model.multimodal_synthesis(input_prompts, guidance_scale_for_llm=5.0, num_return_images=1)[0]
350
- image.save("output/it2i_output.jpg")
 
 
351
  ```
352
 
 
 
 
353
  ## Acknowledgement
354
  We are grateful for the following awesome projects when implementing LaVIT:
355
  * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
356
  * [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
357
  * [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
358
  * [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
 
359
 
360
 
361
  ## <a name="Citing"></a>Citation
@@ -367,4 +388,4 @@ Consider giving this repository a star and cite LaVIT in your publications if it
367
  author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
368
  journal={arXiv preprint arXiv:2309.04669},
369
  year={2023}
370
- }
 
3
  pipeline_tag: text-to-image
4
  ---
5
  # LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
6
+ This is the official repository for the multi-modal large language model: **LaVIT**. The inference code of LaVIT can be found in [here](https://github.com/jy0205/LaVIT).
7
 
8
  [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
9
 
10
  ## News and Updates
11
  * ```2023.10.17``` πŸš€πŸš€πŸš€ We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
12
+ * ```2023.10.31``` 🌟🌟🌟 We update the high-resolution pixel decoder in **LaVIT**, which supports to generate high resolution (1024 * 1024 pixels), muliple aspect ratios (1:1, 4:3, 3:2, 16:9 ...) and high aesthetics images. The quality of generated images have been improved significantly.
13
 
14
  ## Setup
15
 
 
24
  pip install -r requirements.txt
25
  ```
26
 
27
+ * (Optional) We recommend to use memory efficient attention by installing xFormers following the instructions in [here](https://huggingface.co/docs/diffusers/main/en/optimization/xformers). Then, you can set the argument `use_xformers=True` in `build_model` function to save the GPU memory and speed up inference.
28
+
29
  ### Model Zoo
30
  We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
31
  > Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
32
 
33
+ The pre-trained weight of LaVIT can be found on the huggingface from [here](https://huggingface.co/rain1011/LaVIT-7B-v1), which will take around 22GB of disk space. LaVIT achieves state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
34
 
35
  #### Zero-shot Multi-modal Understanding
36
 
 
127
  <td>-</td>
128
  <td>103.9</td>
129
  <td>71.6</td>
130
+ <td>-</td>
131
+ <td>-</td>
132
+ <td>32.3</td>
133
  <td>19.6</td>
134
  </tr>
135
  <tr>
 
263
  </table>
264
 
265
  ## Usage
266
+ LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some examples. Only a few lines of code are needed to use **LaVIT** for inference. We also provide the detailed examples in the following jupyter notebooks for learning how to interact with LaVIT.
267
+
268
+ * `understanding.ipynb` : examples for multi-modal understanding
269
+ * `text2image_synthesis.ipynb`: examples for the text-to-image generation.
270
+ * `multimodal_synthesis.ipynb`: examples for image synthesis with multi-modal prompts.
271
 
272
  ### Multi-modal Understanding
273
 
 
282
  random.seed(42)
283
  torch.manual_seed(42)
284
 
285
+ # The local directory you save the LaVIT pre-trained weight,
286
+ # it will automatically download the checkpoint from huggingface
287
  model_path = '/path/LaVIT_weight'
288
 
289
  # Using BFloat16 during inference
 
313
  # The answer is: orange juice
314
  ```
315
 
316
+ ### Text-to-Image Synthesis
317
 
318
+ For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We set `guidance_scale_for_llm=4.0` by default, you can increase this scale (e.g., 5.0 or 6.0) to encourage the generated image to follow the semantics of given prompts. Besides, you can modify the `ratio` to enable to generate the images with different aspect ratios.
319
 
320
  ```python
321
  import os
 
324
  from models import build_model
325
  from PIL import Image
326
 
327
+ # The local directory you save the LaVIT pre-trained weight,
328
+ # it will automatically download the checkpoint from huggingface
 
329
  model_path = '/path/LaVIT_weight'
330
 
331
  # Using BFloat16 during inference
 
338
  torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
339
 
340
  # Building LaVIT for Generation and load the weight from huggingface
341
+ # You can set `use_xformers=True` if have installed xformers to save GPU mempry and speed up
342
+ model = build_model(model_path=model_path, model_dtype=model_dtype, device_id=device_id,
343
+ use_xformers=False, understanding=False, load_tokenizer=False)
344
  model = model.to(device)
345
 
346
  # Text-to-Image Generation
347
  prompt = "a sculpture of a duck made of wool"
 
 
 
348
 
349
+ # LaVIT support 6 different image aspect ratios
350
+ ratio_dict = {
351
+ '1:1' : (1024, 1024),
352
+ '4:3' : (896, 1152),
353
+ '3:2' : (832, 1216),
354
+ '16:9' : (768, 1344),
355
+ '2:3' : (1216, 832),
356
+ '3:4' : (1152, 896),
357
+ }
358
+
359
+ # The image aspect ratio you want to generate
360
+ ratio = '1:1'
361
+ height, width = ratio_dict[ratio]
362
+
363
  with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
364
+ images = model.generate_image(prompt, width=width, height=height,
365
+ num_return_images=1, guidance_scale_for_llm=4.0, num_inference_steps=50)
366
+
367
+ images[0].save("output/i2t_output.jpg")
368
  ```
369
 
370
+ ## Evaluation
371
+ The batch evaluation code with multiple GPUs on the adopted multi-modal benchmarks will be released in the following days.
372
+
373
  ## Acknowledgement
374
  We are grateful for the following awesome projects when implementing LaVIT:
375
  * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
376
  * [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
377
  * [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
378
  * [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
379
+ * [Diffusers](https://github.com/huggingface/diffusers): State-of-the-art diffusion models for image and audio generation in PyTorch.
380
 
381
 
382
  ## <a name="Citing"></a>Citation
 
388
  author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
389
  journal={arXiv preprint arXiv:2309.04669},
390
  year={2023}
391
+ }