rain1011
/

LaVIT-7B-v1

Text-to-Image

Diffusers

Safetensors

Model card Files Files and versions Community

rain1011 commited on Nov 1, 2023

Commit

7549489

•

1 Parent(s): 13318dd

Update README.md

Browse files

Files changed (1) hide show

README.md +45 -24

README.md CHANGED Viewed

@@ -3,12 +3,13 @@ license: llama2
 pipeline_tag: text-to-image
 ---
 # LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
-This is the official repository for the multi-modal large langauge model: **LaVIT**. The inference code of LaVIT can be found in [here](https://github.com/jy0205/LaVIT).
 [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
 ## News and Updates
 * ```2023.10.17``` 🚀🚀🚀  We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
 ## Setup
@@ -23,11 +24,13 @@ cd LaVIT
 pip install -r requirements.txt
 ```
 ### Model Zoo
 We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
 > Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
-LaVIT achieves the state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
 #### Zero-shot Multi-modal Understanding
@@ -124,9 +127,9 @@ LaVIT achieves the state-of-the-arts performance on various multi-modal downstre
     <td>-</td>
     <td>103.9</td>
     <td>71.6</td>
-    <td>65.0</td>
-    <td>45.9</td>
-    <td>61.0</td>
     <td>19.6</td>
   </tr>
   <tr>
@@ -260,7 +263,11 @@ LaVIT achieves the state-of-the-arts performance on various multi-modal downstre
 </table>
 ## Usage
-LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some example. Only a few lines of codes are needed to use **LaVIT** for inference. We also provide the detailed examples in the jupyter notebooks: `understanding.ipynb` and `generation.ipynb`. You can refer them for learning how to interact with LaVIT.
 ### Multi-modal Understanding
@@ -275,7 +282,8 @@ from PIL import Image
 random.seed(42)
 torch.manual_seed(42)
-# The local directory you save the LaVIT pre-trained weight
 model_path = '/path/LaVIT_weight'
 # Using BFloat16 during inference
@@ -305,9 +313,9 @@ print("The answer is: ", answer)
 # The answer is: orange juice
 ```
-### Multi-modal generation
-For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We recommend to set `guidance_scale_for_llm=3.0` by default, you can increase this scale (e.g., 4.0 or 5.0) for encouraging the generated image to follow the semantics of given prompts.
 ```python
 import os
@@ -316,9 +324,8 @@ import torch.nn as nn
 from models import build_model
 from PIL import Image
-torch.manual_seed(42)
-# The local directory you save the LaVIT pre-trained weight
 model_path = '/path/LaVIT_weight'
 # Using BFloat16 during inference
@@ -331,31 +338,45 @@ device = torch.device('cuda')
 torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
 # Building LaVIT for Generation and load the weight from huggingface
-model = build_model(model_path=model_path, model_dtype=model_dtype,
-            device_id=device_id, use_xformers=False, understanding=False)
 model = model.to(device)
 # Text-to-Image Generation
 prompt = "a sculpture of a duck made of wool"
-with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
-  image = model.generate_image(prompt, guidance_scale_for_llm=3.0, num_return_images=1)[0]
-image.save("output/i2t_output.jpg")
-# Multi-modal Image synthesis
-image_prompt = 'demo/dog.jpg'
-text_prompt = 'It is running in the snow'
-input_prompts = [(image_prompt, 'image'), (text_prompt, 'text')]
 with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
-  image = model.multimodal_synthesis(input_prompts, guidance_scale_for_llm=5.0, num_return_images=1)[0]
-image.save("output/it2i_output.jpg")
 ```
 ## Acknowledgement
 We are grateful for the following awesome projects when implementing LaVIT:
 * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
 * [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
 * [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
 * [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
 ## <a name="Citing"></a>Citation
@@ -367,4 +388,4 @@ Consider giving this repository a star and cite LaVIT in your publications if it
   author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
   journal={arXiv preprint arXiv:2309.04669},
   year={2023}
-}

 pipeline_tag: text-to-image
 ---
 # LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
+This is the official repository for the multi-modal large language model: **LaVIT**. The inference code of LaVIT can be found in [here](https://github.com/jy0205/LaVIT).
 [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
 ## News and Updates
 * ```2023.10.17``` 🚀🚀🚀  We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.
+* ```2023.10.31``` 🌟🌟🌟 We update the high-resolution pixel decoder in **LaVIT**, which supports to generate high resolution (1024 * 1024 pixels), muliple aspect ratios (1:1, 4:3, 3:2, 16:9 ...) and high aesthetics images. The quality of generated images have been improved significantly.
 ## Setup
 pip install -r requirements.txt
 ```
+* (Optional) We recommend to use memory efficient attention by installing xFormers following the instructions in [here](https://huggingface.co/docs/diffusers/main/en/optimization/xformers). Then, you can set the argument `use_xformers=True` in `build_model` function  to save the GPU memory and speed up inference.
 ### Model Zoo
 We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
 > Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
+The pre-trained weight of LaVIT can be found on the huggingface from [here](https://huggingface.co/rain1011/LaVIT-7B-v1), which will take around 22GB of disk space. LaVIT achieves state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
 #### Zero-shot Multi-modal Understanding
     <td>-</td>
     <td>103.9</td>
     <td>71.6</td>
+    <td>-</td>
+    <td>-</td>
+    <td>32.3</td>
     <td>19.6</td>
   </tr>
   <tr>
 </table>
 ## Usage
+LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some examples. Only a few lines of code are needed to use **LaVIT** for inference. We also provide the detailed examples in the following jupyter notebooks for learning how to interact with LaVIT.
+* `understanding.ipynb` : examples for multi-modal understanding
+* `text2image_synthesis.ipynb`: examples for the text-to-image generation.
+* `multimodal_synthesis.ipynb`: examples for image synthesis with multi-modal prompts.
 ### Multi-modal Understanding
 random.seed(42)
 torch.manual_seed(42)
+# The local directory you save the LaVIT pre-trained weight,
+# it will automatically download the checkpoint from huggingface
 model_path = '/path/LaVIT_weight'
 # Using BFloat16 during inference
 # The answer is: orange juice
 ```
+### Text-to-Image Synthesis
+For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We set `guidance_scale_for_llm=4.0` by default, you can increase this scale (e.g., 5.0 or 6.0) to encourage the generated image to follow the semantics of given prompts. Besides, you can modify the `ratio` to enable to generate the images with different aspect ratios.
 ```python
 import os
 from models import build_model
 from PIL import Image
+# The local directory you save the LaVIT pre-trained weight,
+# it will automatically download the checkpoint from huggingface
 model_path = '/path/LaVIT_weight'
 # Using BFloat16 during inference
 torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
 # Building LaVIT for Generation and load the weight from huggingface
+# You can set `use_xformers=True` if have installed xformers to save GPU mempry and speed up
+model = build_model(model_path=model_path, model_dtype=model_dtype, device_id=device_id,
+       use_xformers=False, understanding=False, load_tokenizer=False)
 model = model.to(device)
 # Text-to-Image Generation
 prompt = "a sculpture of a duck made of wool"
+# LaVIT support 6 different image aspect ratios
+ratio_dict = {
+    '1:1' : (1024, 1024),
+    '4:3' : (896, 1152),
+    '3:2' : (832, 1216),
+    '16:9' : (768, 1344),
+    '2:3' : (1216, 832),
+    '3:4' : (1152, 896),
+}
+# The image aspect ratio you want to generate
+ratio = '1:1'
+height, width = ratio_dict[ratio]
 with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
+    images = model.generate_image(prompt, width=width, height=height,
+    num_return_images=1, guidance_scale_for_llm=4.0, num_inference_steps=50)
+images[0].save("output/i2t_output.jpg")
 ```
+## Evaluation
+The batch evaluation code with multiple GPUs on the adopted multi-modal benchmarks will be released in the following days.
 ## Acknowledgement
 We are grateful for the following awesome projects when implementing LaVIT:
 * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
 * [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
 * [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
 * [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
+* [Diffusers](https://github.com/huggingface/diffusers): State-of-the-art diffusion models for image and audio generation in PyTorch.
 ## <a name="Citing"></a>Citation
   author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
   journal={arXiv preprint arXiv:2309.04669},
   year={2023}
+}