zR commited on
Commit
cb08fa4
1 Parent(s): d055a49

change qingying page

Browse files
Files changed (2) hide show
  1. README.md +59 -4
  2. README_zh.md +61 -2
README.md CHANGED
@@ -86,13 +86,13 @@ inference: false
86
 
87
  ## Model Introduction
88
 
89
- CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
90
 
91
  <table style="border-collapse: collapse; width: 100%;">
92
  <tr>
93
  <th style="text-align: center;">Model Name</th>
94
  <th style="text-align: center;">CogVideoX-2B (This Repository)</th>
95
- <th style="text-align: center;">CogVideoX-5B </th>
96
  </tr>
97
  <tr>
98
  <td style="text-align: center;">Model Description</td>
@@ -106,8 +106,8 @@ CogVideoX is an open-source version of the video generation model originating fr
106
  </tr>
107
  <tr>
108
  <td style="text-align: center;">Single GPU VRAM Consumption</td>
109
- <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
110
- <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
111
  </tr>
112
  <tr>
113
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
@@ -218,6 +218,61 @@ video = pipe(
218
  export_to_video(video, "output.mp4", fps=8)
219
  ```
220
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
  ## Explore the Model
222
 
223
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
 
86
 
87
  ## Model Introduction
88
 
89
+ CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
90
 
91
  <table style="border-collapse: collapse; width: 100%;">
92
  <tr>
93
  <th style="text-align: center;">Model Name</th>
94
  <th style="text-align: center;">CogVideoX-2B (This Repository)</th>
95
+ <th style="text-align: center;">CogVideoX-5B</th>
96
  </tr>
97
  <tr>
98
  <td style="text-align: center;">Model Description</td>
 
106
  </tr>
107
  <tr>
108
  <td style="text-align: center;">Single GPU VRAM Consumption</td>
109
+ <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
110
+ <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
111
  </tr>
112
  <tr>
113
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
 
218
  export_to_video(video, "output.mp4", fps=8)
219
  ```
220
 
221
+ ## Quantized Inference
222
+
223
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
224
+
225
+ ```diff
226
+ # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
227
+ # Source and nightly installation is only required until next release.
228
+
229
+ import torch
230
+ from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
231
+ from diffusers.utils import export_to_video
232
+ + from transformers import T5EncoderModel
233
+ + from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight
234
+
235
+ + quantization = int8_weight_only
236
+
237
+ + text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
238
+ + quantize_(text_encoder, quantization())
239
+
240
+ + transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
241
+ + quantize_(transformer, quantization())
242
+
243
+ + vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
244
+ + quantize_(vae, quantization())
245
+
246
+ # Create pipeline and run inference
247
+ pipe = CogVideoXPipeline.from_pretrained(
248
+ "THUDM/CogVideoX-2b",
249
+ + text_encoder=text_encoder,
250
+ + transformer=transformer,
251
+ + vae=vae,
252
+ torch_dtype=torch.bfloat16,
253
+ )
254
+ pipe.enable_model_cpu_offload()
255
+ pipe.vae.enable_tiling()
256
+
257
+ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
258
+
259
+ video = pipe(
260
+ prompt=prompt,
261
+ num_videos_per_prompt=1,
262
+ num_inference_steps=50,
263
+ num_frames=49,
264
+ guidance_scale=6,
265
+ generator=torch.Generator(device="cuda").manual_seed(42),
266
+ ).frames[0]
267
+
268
+ export_to_video(video, "output.mp4", fps=8)
269
+ ```
270
+
271
+ Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO. Find examples and benchmarks at these links:
272
+ - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
273
+ - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
274
+
275
+
276
  ## Explore the Model
277
 
278
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
README_zh.md CHANGED
@@ -150,7 +150,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideo) 同源的开源
150
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
151
  + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
152
  + 2B 模型采用 `FP16` 精度训练, 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
153
- + `FP8` 精度必须在`NVIDIA H100` 及以上的设备上使用,需要源代码安装`torch`,`torchao`,`diffusers`,`accelerate` python包,推荐使用 `CUDA 12.4`。
 
154
  + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
155
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
156
 
@@ -203,6 +204,63 @@ video = pipe(
203
  export_to_video(video, "output.mp4", fps=8)
204
  ```
205
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
  ## 深入研究
207
 
208
  欢迎进入我们的 [github](https://github.com/THUDM/CogVideo),你将获得:
@@ -218,7 +276,8 @@ export_to_video(video, "output.mp4", fps=8)
218
 
219
  CogVideoX-2B 模型 (包括其对应的Transformers模块,VAE模块) 根据 [Apache 2.0 License](LICENSE) 许可证发布。
220
 
221
- CogVideoX-5B 模型 (Transformers 模块) 根据 [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE)
 
222
  许可证发布。
223
 
224
  ## 引用
 
150
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
151
  + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
152
  + 2B 模型采用 `FP16` 精度训练, 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
153
+ + `FP8` 精度必须在`NVIDIA H100` 及以上的设备上使用,需要源代码安装`torch`,`torchao`,`diffusers`,`accelerate`
154
+ python包,推荐使用 `CUDA 12.4`。
155
  + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
156
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
157
 
 
204
  export_to_video(video, "output.mp4", fps=8)
205
  ```
206
 
207
+ ## Quantized Inference
208
+
209
+ [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
210
+ 可以用于对文本编码器、Transformer 和 VAE 模块进行量化,从而降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或较小 VRAM 的
211
+ GPU 上运行该模型成为可能!值得注意的是,TorchAO 量化与 `torch.compile` 完全兼容,这可以显著加快推理速度。
212
+
213
+ ```diff
214
+ # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
215
+ # Source and nightly installation is only required until next release.
216
+
217
+ import torch
218
+ from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
219
+ from diffusers.utils import export_to_video
220
+ + from transformers import T5EncoderModel
221
+ + from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight
222
+
223
+ + quantization = int8_weight_only
224
+
225
+ + text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
226
+ + quantize_(text_encoder, quantization())
227
+
228
+ + transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
229
+ + quantize_(transformer, quantization())
230
+
231
+ + vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
232
+ + quantize_(vae, quantization())
233
+
234
+ # Create pipeline and run inference
235
+ pipe = CogVideoXPipeline.from_pretrained(
236
+ "THUDM/CogVideoX-2b",
237
+ + text_encoder=text_encoder,
238
+ + transformer=transformer,
239
+ + vae=vae,
240
+ torch_dtype=torch.bfloat16,
241
+ )
242
+ pipe.enable_model_cpu_offload()
243
+ pipe.vae.enable_tiling()
244
+
245
+ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
246
+
247
+ video = pipe(
248
+ prompt=prompt,
249
+ num_videos_per_prompt=1,
250
+ num_inference_steps=50,
251
+ num_frames=49,
252
+ guidance_scale=6,
253
+ generator=torch.Generator(device="cuda").manual_seed(42),
254
+ ).frames[0]
255
+
256
+ export_to_video(video, "output.mp4", fps=8)
257
+ ```
258
+
259
+ 此外,这些模型可以通过使用PytorchAO以量化数据类型序列化并存储,从而节省磁盘空间。你可以在以下链接中找到示例和基准测试。
260
+
261
+ - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
262
+ - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
263
+
264
  ## 深入研究
265
 
266
  欢迎进入我们的 [github](https://github.com/THUDM/CogVideo),你将获得:
 
276
 
277
  CogVideoX-2B 模型 (包括其对应的Transformers模块,VAE模块) 根据 [Apache 2.0 License](LICENSE) 许可证发布。
278
 
279
+ CogVideoX-5B 模型 (Transformers 模块)
280
+ 根据 [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE)
281
  许可证发布。
282
 
283
  ## 引用