Text-to-Video
Diffusers
Safetensors
I2VGenXLPipeline
image-to-video

vilab_pullrequest

#3
by jensinjames - opened
README.md CHANGED
@@ -1,8 +1,5 @@
1
  ---
2
  license: mit
3
- tags:
4
- - image-to-video
5
- pipeline_tag: text-to-video
6
  ---
7
  # VGen
8
 
@@ -238,47 +235,7 @@ In preparation.
238
 
239
  Our codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including `ENGINE, MODEL, DATASETS, EMBEDDER, AUTO_ENCODER, DISTRIBUTION, VISUAL, DIFFUSION, PRETRAIN`, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time.
240
 
241
- ## Integration of I2VGenXL with 🧨 diffusers
242
 
243
- I2VGenXL is supported in the 🧨 diffusers library. Here's how to use it:
244
-
245
- ```python
246
- import torch
247
- from diffusers import I2VGenXLPipeline
248
- from diffusers.utils import load_image, export_to_gif
249
-
250
- repo_id = "ali-vilab/i2vgen-xl"
251
- pipeline = I2VGenXLPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16").to("cuda")
252
-
253
- image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?download=true"
254
- image = load_image(image_url).convert("RGB")
255
- prompt = "Papers were floating in the air on a table in the library"
256
-
257
- generator = torch.manual_seed(8888)
258
- frames = pipeline(
259
- prompt=prompt,
260
- image=image,
261
- generator=generator
262
- ).frames[0]
263
-
264
- print(export_to_gif(frames))
265
- ```
266
-
267
- Find the official documentation [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/i2vgenxl).
268
-
269
- Sample output with I2VGenXL:
270
-
271
- <table>
272
- <tr>
273
- <td><center>
274
- masterpiece, bestquality, sunset.
275
- <br>
276
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"
277
- alt="library"
278
- style="width: 300px;" />
279
- </center></td>
280
- </tr>
281
- </table>
282
 
283
  ## BibTeX
284
 
@@ -326,4 +283,4 @@ If this repo is useful to you, please cite our corresponding technical paper.
326
 
327
  ## Disclaimer
328
 
329
- This open-source model is trained with using [WebVid-10M](https://m-bain.github.io/webvid-dataset/) and [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) datasets and is intended for <strong>RESEARCH/NON-COMMERCIAL USE ONLY</strong>.
 
1
  ---
2
  license: mit
 
 
 
3
  ---
4
  # VGen
5
 
 
235
 
236
  Our codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including `ENGINE, MODEL, DATASETS, EMBEDDER, AUTO_ENCODER, DISTRIBUTION, VISUAL, DIFFUSION, PRETRAIN`, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time.
237
 
 
238
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
239
 
240
  ## BibTeX
241
 
 
283
 
284
  ## Disclaimer
285
 
286
+ This open-source model is trained with using [WebVid-10M](https://m-bain.github.io/webvid-dataset/) and [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) datasets and is intended for <strong>RESEARCH/NON-COMMERCIAL USE ONLY</strong>.
README_diffusers.md DELETED
@@ -1,334 +0,0 @@
1
- ---
2
- license: mit
3
- library_name: diffusers
4
- tags:
5
- - image-to-video
6
- pipeline_tag: text-to-video
7
- ---
8
- # VGen
9
-
10
-
11
- ![figure1](source/VGen.jpg "figure1")
12
-
13
- VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models. This repository includes implementations of the following methods:
14
-
15
-
16
- - [I2VGen-xl: High-quality image-to-video synthesis via cascaded diffusion models](https://i2vgen-xl.github.io/)
17
- - [VideoComposer: Compositional Video Synthesis with Motion Controllability](https://videocomposer.github.io/)
18
- - [Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation](https://higen-t2v.github.io/)
19
- - [A Recipe for Scaling up Text-to-Video Generation with Text-free Videos]()
20
- - [InstructVideo: Instructing Video Diffusion Models with Human Feedback]()
21
- - [DreamVideo: Composing Your Dream Videos with Customized Subject and Motion](https://dreamvideo-t2v.github.io/)
22
- - [VideoLCM: Video Latent Consistency Model](https://arxiv.org/abs/2312.09109)
23
- - [Modelscope text-to-video technical report](https://arxiv.org/abs/2308.06571)
24
-
25
-
26
- VGen can produce high-quality videos from the input text, images, desired motion, desired subjects, and even the feedback signals provided. It also offers a variety of commonly used video generation tools such as visualization, sampling, training, inference, join training using images and videos, acceleration, and more.
27
-
28
-
29
- <a href='https://i2vgen-xl.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2311.04145'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/XUi0y7dxqEQ) <a href='https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441039979087.mp4'><img src='source/logo.png'></a>
30
-
31
-
32
- ## 🔥News!!!
33
- - __[2024.01]__ Diffusers now supports I2VGenXL
34
- - __[2023.12]__ We release the high-efficiency video generation method [VideoLCM](https://arxiv.org/abs/2312.09109)
35
- - __[2023.12]__ We release the code and model of I2VGen-XL and the ModelScope T2V
36
- - __[2023.12]__ We release the T2V method [HiGen](https://higen-t2v.github.io) and customizing T2V method [DreamVideo](https://dreamvideo-t2v.github.io).
37
- - __[2023.12]__ We write an [introduction docment](doc/introduction.pdf) for VGen and compare I2VGen-XL with SVD.
38
- - __[2023.11]__ We release a high-quality I2VGen-XL model, please refer to the [Webpage](https://i2vgen-xl.github.io)
39
-
40
-
41
- ## TODO
42
- - [x] Release the technical papers and webpage of [I2VGen-XL](doc/i2vgen-xl.md)
43
- - [x] Release the code and pretrained models that can generate 1280x720 videos
44
- - [ ] Release models optimized specifically for the human body and faces
45
- - [ ] Updated version can fully maintain the ID and capture large and accurate motions simultaneously
46
- - [ ] Release other methods and the corresponding models
47
-
48
-
49
- ## Preparation
50
-
51
- The main features of VGen are as follows:
52
- - Expandability, allowing for easy management of your own experiments.
53
- - Completeness, encompassing all common components for video generation.
54
- - Excellent performance, featuring powerful pre-trained models in multiple tasks.
55
-
56
-
57
- ### Installation
58
-
59
- ```
60
- conda create -n vgen python=3.8
61
- conda activate vgen
62
- pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
63
- pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
64
- ```
65
-
66
- ### Datasets
67
-
68
- We have provided a **demo dataset** that includes images and videos, along with their lists in ``data``.
69
-
70
- *Please note that the demo images used here are for testing purposes and were not included in the training.*
71
-
72
-
73
- ### Clone codeb
74
-
75
- ```
76
- git clone https://github.com/damo-vilab/i2vgen-xl.git
77
- cd i2vgen-xl
78
- ```
79
-
80
-
81
- ## Getting Started with VGen
82
-
83
- ### (1) Train your text-to-video model
84
-
85
-
86
- Executing the following command to enable distributed training is as easy as that.
87
- ```
88
- python train_net.py --cfg configs/t2v_train.yaml
89
- ```
90
-
91
- In the `t2v_train.yaml` configuration file, you can specify the data, adjust the video-to-image ratio using `frame_lens`, and validate your ideas with different Diffusion settings, and so on.
92
-
93
- - Before the training, you can download any of our open-source models for initialization. Our codebase supports custom initialization and `grad_scale` settings, all of which are included in the `Pretrain` item in yaml file.
94
- - During the training, you can view the saved models and intermediate inference results in the `workspace/experiments/t2v_train`directory.
95
-
96
- After the training is completed, you can perform inference on the model using the following command.
97
- ```
98
- python inference.py --cfg configs/t2v_infer.yaml
99
- ```
100
- Then you can find the videos you generated in the `workspace/experiments/test_img_01` directory. For specific configurations such as data, models, seed, etc., please refer to the `t2v_infer.yaml` file.
101
-
102
- <!-- <table>
103
- <center>
104
- <tr>
105
- <td ><center>
106
- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441754174077.mp4"></video>
107
- </center></td>
108
- <td ><center>
109
- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441138824052.mp4"></video>
110
- </center></td>
111
- </tr>
112
- </center>
113
- </table>
114
- </center> -->
115
-
116
- <table>
117
- <center>
118
- <tr>
119
- <td ><center>
120
- <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01Ya2I5I25utrJwJ9Jf_!!6000000007587-2-tps-1280-720.png"></image>
121
- </center></td>
122
- <td ><center>
123
- <image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01CrmYaz1zXBetmg3dd_!!6000000006723-2-tps-1280-720.png"></image>
124
- </center></td>
125
- </tr>
126
- <tr>
127
- <td ><center>
128
- <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441754174077.mp4">HRER</a> to view the generated video.</p>
129
- </center></td>
130
- <td ><center>
131
- <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441138824052.mp4">HRER</a> to view the generated video.</p>
132
- </center></td>
133
- </tr>
134
- </center>
135
- </table>
136
- </center>
137
-
138
-
139
- ### (2) Run the I2VGen-XL model
140
-
141
- (i) Download model and test data:
142
- ```
143
- !pip install modelscope
144
- from modelscope.hub.snapshot_download import snapshot_download
145
- model_dir = snapshot_download('damo/I2VGen-XL', cache_dir='models/', revision='v1.0.0')
146
- ```
147
-
148
- (ii) Run the following command:
149
- ```
150
- python inference.py --cfg configs/i2vgen_xl_infer.yaml
151
- ```
152
- In a few minutes, you can retrieve the high-definition video you wish to create from the `workspace/experiments/test_img_01` directory. At present, we find that the current model performs inadequately on **anime images** and **images with a black background** due to the lack of relevant training data. We are consistently working to optimize it.
153
-
154
-
155
- <span style="color:red">Due to the compression of our video quality in GIF format, please click 'HRER' below to view the original video.</span>
156
-
157
- <center>
158
- <table>
159
- <center>
160
- <tr>
161
- <td ><center>
162
- <image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01CCEq7K1ZeLpNQqrWu_!!6000000003219-0-tps-1280-720.jpg"></image>
163
- </center></td>
164
- <td ><center>
165
- <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4"></video> -->
166
- <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01hIQcvG1spmQMLqBo0_!!6000000005816-1-tps-1280-704.gif"></image>
167
- </center></td>
168
- </tr>
169
- <tr>
170
- <td ><center>
171
- <p>Input Image</p>
172
- </center></td>
173
- <td ><center>
174
- <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4">HRER</a> to view the generated video.</p>
175
- </center></td>
176
- </tr>
177
- <tr>
178
- <td ><center>
179
- <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01ZXY7UN23K8q4oQ3uG_!!6000000007236-2-tps-1280-720.png"></image>
180
- </center></td>
181
- <td ><center>
182
- <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4"></video> -->
183
- <image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01iaSiiv1aJZURUEY53_!!6000000003309-1-tps-1280-704.gif"></image>
184
- </center></td>
185
- </tr>
186
- <tr>
187
- <td ><center>
188
- <p>Input Image</p>
189
- </center></td>
190
- <td ><center>
191
- <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4">HRER</a> to view the generated video.</p>
192
- </center></td>
193
- </tr>
194
- <tr>
195
- <td ><center>
196
- <image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01NHpVGl1oat4H54Hjf_!!6000000005242-2-tps-1280-720.png"></image>
197
- </center></td>
198
- <td ><center>
199
- <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4"></video> -->
200
- <!-- <image muted="true" height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif"></image>
201
- -->
202
- <image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif"></image>
203
- </center></td>
204
- </tr>
205
- <tr>
206
- <td ><center>
207
- <p>Input Image</p>
208
- </center></td>
209
- <td ><center>
210
- <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4">HERE</a> to view the generated video.</p>
211
- </center></td>
212
- </tr>
213
- <tr>
214
- <td ><center>
215
- <image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01odS61s1WW9tXen21S_!!6000000002795-0-tps-1280-720.jpg"></image>
216
- </center></td>
217
- <td ><center>
218
- <!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4"></video> -->
219
- <image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01Jyk1HT28JkZtpAtY6_!!6000000007912-1-tps-1280-704.gif"></image>
220
- </center></td>
221
- </tr>
222
- <tr>
223
- <td ><center>
224
- <p>Input Image</p>
225
- </center></td>
226
- <td ><center>
227
- <p>Clike <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4">HERE</a> to view the generated video.</p>
228
- </center></td>
229
- </tr>
230
- </center>
231
- </table>
232
- </center>
233
-
234
- ### (3) Other methods
235
-
236
- In preparation.
237
-
238
-
239
- ## Customize your own approach
240
-
241
- Our codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including `ENGINE, MODEL, DATASETS, EMBEDDER, AUTO_ENCODER, DISTRIBUTION, VISUAL, DIFFUSION, PRETRAIN`, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time.
242
-
243
- ## Integration of I2VGenXL with 🧨 diffusers
244
-
245
- I2VGenXL is supported in the 🧨 diffusers library. Here's how to use it:
246
-
247
- ```python
248
- import torch
249
- from diffusers import I2VGenXLPipeline
250
- from diffusers.utils import load_image, export_to_gif
251
-
252
- repo_id = "ali-vilab/i2vgen-xl"
253
- pipeline = I2VGenXLPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16").to("cuda")
254
-
255
- image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?download=true"
256
- image = load_image(image_url).convert("RGB")
257
- prompt = "Papers were floating in the air on a table in the library"
258
-
259
- generator = torch.manual_seed(8888)
260
- frames = pipeline(
261
- prompt=prompt,
262
- image=image,
263
- generator=generator
264
- ).frames[0]
265
-
266
- print(export_to_gif(frames))
267
- ```
268
-
269
- Find the official documentation [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/i2vgenxl).
270
-
271
- Sample output with I2VGenXL:
272
-
273
- <table>
274
- <tr>
275
- <td><center>
276
- masterpiece, bestquality, sunset.
277
- <br>
278
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"
279
- alt="library"
280
- style="width: 300px;" />
281
- </center></td>
282
- </tr>
283
- </table>
284
-
285
- ## BibTeX
286
-
287
- If this repo is useful to you, please cite our corresponding technical paper.
288
-
289
-
290
- ```bibtex
291
- @article{2023i2vgenxl,
292
- title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
293
- author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang and Zhao, Deli and Zhou, Jingren},
294
- booktitle={arXiv preprint arXiv:2311.04145},
295
- year={2023}
296
- }
297
- @article{2023videocomposer,
298
- title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
299
- author={Wang, Xiang and Yuan, Hangjie and Zhang, Shiwei and Chen, Dayou and Wang, Jiuniu, and Zhang, Yingya, and Shen, Yujun, and Zhao, Deli and Zhou, Jingren},
300
- booktitle={arXiv preprint arXiv:2306.02018},
301
- year={2023}
302
- }
303
- @article{wang2023modelscope,
304
- title={Modelscope text-to-video technical report},
305
- author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
306
- journal={arXiv preprint arXiv:2308.06571},
307
- year={2023}
308
- }
309
- @article{dreamvideo,
310
- title={DreamVideo: Composing Your Dream Videos with Customized Subject and Motion},
311
- author={Wei, Yujie and Zhang, Shiwei and Qing, Zhiwu and Yuan, Hangjie and Liu, Zhiheng and Liu, Yu and Zhang, Yingya and Zhou, Jingren and Shan, Hongming},
312
- journal={arXiv preprint arXiv:2312.04433},
313
- year={2023}
314
- }
315
- @article{qing2023higen,
316
- title={Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation},
317
- author={Qing, Zhiwu and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Wei, Yujie and Zhang, Yingya and Gao, Changxin and Sang, Nong },
318
- journal={arXiv preprint arXiv:2312.04483},
319
- year={2023}
320
- }
321
- @article{wang2023videolcm,
322
- title={VideoLCM: Video Latent Consistency Model},
323
- author={Wang, Xiang and Zhang, Shiwei and Zhang, Han and Liu, Yu and Zhang, Yingya and Gao, Changxin and Sang, Nong },
324
- journal={arXiv preprint arXiv:2312.09109},
325
- year={2023}
326
- }
327
- ```
328
-
329
- ## Disclaimer
330
-
331
- This open-source model is trained with using [WebVid-10M](https://m-bain.github.io/webvid-dataset/) and [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) datasets and is intended for <strong>RESEARCH/NON-COMMERCIAL USE ONLY</strong>.
332
-
333
-
334
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
feature_extractor/preprocessor_config.json DELETED
@@ -1,27 +0,0 @@
1
- {
2
- "crop_size": {
3
- "height": 224,
4
- "width": 224
5
- },
6
- "do_center_crop": true,
7
- "do_convert_rgb": true,
8
- "do_normalize": true,
9
- "do_rescale": true,
10
- "do_resize": true,
11
- "image_mean": [
12
- 0.48145466,
13
- 0.4578275,
14
- 0.40821073
15
- ],
16
- "image_processor_type": "CLIPImageProcessor",
17
- "image_std": [
18
- 0.26862954,
19
- 0.26130258,
20
- 0.27577711
21
- ],
22
- "resample": 3,
23
- "rescale_factor": 0.00392156862745098,
24
- "size": {
25
- "shortest_edge": 224
26
- }
27
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
image_encoder/config.json DELETED
@@ -1,23 +0,0 @@
1
- {
2
- "_name_or_path": "i2vgen-xl/image_encoder",
3
- "architectures": [
4
- "CLIPVisionModelWithProjection"
5
- ],
6
- "attention_dropout": 0.0,
7
- "dropout": 0.0,
8
- "hidden_act": "gelu",
9
- "hidden_size": 1280,
10
- "image_size": 224,
11
- "initializer_factor": 1.0,
12
- "initializer_range": 0.02,
13
- "intermediate_size": 5120,
14
- "layer_norm_eps": 1e-05,
15
- "model_type": "clip_vision_model",
16
- "num_attention_heads": 16,
17
- "num_channels": 3,
18
- "num_hidden_layers": 32,
19
- "patch_size": 14,
20
- "projection_dim": 1024,
21
- "torch_dtype": "float16",
22
- "transformers_version": "4.36.2"
23
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
image_encoder/model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ae616c24393dd1854372b0639e5541666f7521cbe219669255e865cb7f89466a
3
- size 1264217240
 
 
 
 
image_encoder/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ed1e5af7b4042ca30ec29999a4a5cfcac90b7fb610fd05ace834f2dcbb763eab
3
- size 2528371296
 
 
 
 
model_index.json DELETED
@@ -1,33 +0,0 @@
1
- {
2
- "_class_name": "I2VGenXLPipeline",
3
- "_diffusers_version": "0.26.1",
4
- "_name_or_path": "i2vgen-xl",
5
- "feature_extractor": [
6
- "transformers",
7
- "CLIPImageProcessor"
8
- ],
9
- "image_encoder": [
10
- "transformers",
11
- "CLIPVisionModelWithProjection"
12
- ],
13
- "scheduler": [
14
- "diffusers",
15
- "DDIMScheduler"
16
- ],
17
- "text_encoder": [
18
- "transformers",
19
- "CLIPTextModel"
20
- ],
21
- "tokenizer": [
22
- "transformers",
23
- "CLIPTokenizer"
24
- ],
25
- "unet": [
26
- "diffusers",
27
- "I2VGenXLUNet"
28
- ],
29
- "vae": [
30
- "diffusers",
31
- "AutoencoderKL"
32
- ]
33
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scheduler/scheduler_config.json DELETED
@@ -1,19 +0,0 @@
1
- {
2
- "_class_name": "DDIMScheduler",
3
- "_diffusers_version": "0.26.1",
4
- "beta_end": 0.02,
5
- "beta_schedule": "squaredcos_cap_v2",
6
- "beta_start": 0.0001,
7
- "clip_sample": false,
8
- "clip_sample_range": 1.0,
9
- "dynamic_thresholding_ratio": 0.995,
10
- "num_train_timesteps": 1000,
11
- "prediction_type": "v_prediction",
12
- "rescale_betas_zero_snr": true,
13
- "sample_max_value": 1.0,
14
- "set_alpha_to_one": true,
15
- "steps_offset": 1,
16
- "thresholding": false,
17
- "timestep_spacing": "leading",
18
- "trained_betas": null
19
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
text_encoder/config.json DELETED
@@ -1,25 +0,0 @@
1
- {
2
- "_name_or_path": "i2vgen-xl/text_encoder",
3
- "architectures": [
4
- "CLIPTextModel"
5
- ],
6
- "attention_dropout": 0.0,
7
- "bos_token_id": 0,
8
- "dropout": 0.0,
9
- "eos_token_id": 2,
10
- "hidden_act": "gelu",
11
- "hidden_size": 1024,
12
- "initializer_factor": 1.0,
13
- "initializer_range": 0.02,
14
- "intermediate_size": 4096,
15
- "layer_norm_eps": 1e-05,
16
- "max_position_embeddings": 77,
17
- "model_type": "clip_text_model",
18
- "num_attention_heads": 16,
19
- "num_hidden_layers": 24,
20
- "pad_token_id": 1,
21
- "projection_dim": 1024,
22
- "torch_dtype": "float16",
23
- "transformers_version": "4.36.2",
24
- "vocab_size": 49408
25
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
text_encoder/model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:0eb4cf70d7768f4f61bb25e98b2ddaf545f6b33ecc2d7cc3eaa4670a09722fd2
3
- size 706014768
 
 
 
 
text_encoder/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:bd94a7ea6922e8028227567fe14e04d2989eec31c482e0813e9006afea6637f1
3
- size 1411983168
 
 
 
 
tokenizer/merges.txt DELETED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json DELETED
@@ -1,30 +0,0 @@
1
- {
2
- "bos_token": {
3
- "content": "<|startoftext|>",
4
- "lstrip": false,
5
- "normalized": true,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "<|endoftext|>",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": {
17
- "content": "<|endoftext|>",
18
- "lstrip": false,
19
- "normalized": false,
20
- "rstrip": false,
21
- "single_word": false
22
- },
23
- "unk_token": {
24
- "content": "<|endoftext|>",
25
- "lstrip": false,
26
- "normalized": false,
27
- "rstrip": false,
28
- "single_word": false
29
- }
30
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer/tokenizer_config.json DELETED
@@ -1,30 +0,0 @@
1
- {
2
- "add_prefix_space": false,
3
- "added_tokens_decoder": {
4
- "49406": {
5
- "content": "<|startoftext|>",
6
- "lstrip": false,
7
- "normalized": true,
8
- "rstrip": false,
9
- "single_word": false,
10
- "special": true
11
- },
12
- "49407": {
13
- "content": "<|endoftext|>",
14
- "lstrip": false,
15
- "normalized": false,
16
- "rstrip": false,
17
- "single_word": false,
18
- "special": true
19
- }
20
- },
21
- "bos_token": "<|startoftext|>",
22
- "clean_up_tokenization_spaces": true,
23
- "do_lower_case": true,
24
- "eos_token": "<|endoftext|>",
25
- "errors": "replace",
26
- "model_max_length": 77,
27
- "pad_token": "<|endoftext|>",
28
- "tokenizer_class": "CLIPTokenizer",
29
- "unk_token": "<|endoftext|>"
30
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer/vocab.json DELETED
The diff for this file is too large to render. See raw diff
 
unet/config.json DELETED
@@ -1,31 +0,0 @@
1
- {
2
- "_class_name": "I2VGenXLUNet",
3
- "_diffusers_version": "0.26.1",
4
- "_name_or_path": "i2vgen-xl/unet",
5
- "attention_head_dim": 64,
6
- "block_out_channels": [
7
- 320,
8
- 640,
9
- 1280,
10
- 1280
11
- ],
12
- "cross_attention_dim": 1024,
13
- "down_block_types": [
14
- "CrossAttnDownBlock3D",
15
- "CrossAttnDownBlock3D",
16
- "CrossAttnDownBlock3D",
17
- "DownBlock3D"
18
- ],
19
- "in_channels": 4,
20
- "layers_per_block": 2,
21
- "norm_num_groups": 32,
22
- "num_attention_heads": 64,
23
- "out_channels": 4,
24
- "sample_size": 32,
25
- "up_block_types": [
26
- "UpBlock3D",
27
- "CrossAttnUpBlock3D",
28
- "CrossAttnUpBlock3D",
29
- "CrossAttnUpBlock3D"
30
- ]
31
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
unet/diffusion_pytorch_model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:6ef043141380ab4f5f0698d0b735e4020d6049819e3c756ddbe2672a48a466d4
3
- size 2841124432
 
 
 
 
unet/diffusion_pytorch_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:3e7c73b3ff159580a1a6535b1ccb473b09a2f40853c03f5d546db70632456ab8
3
- size 5682063336
 
 
 
 
vae/config.json DELETED
@@ -1,32 +0,0 @@
1
- {
2
- "_class_name": "AutoencoderKL",
3
- "_diffusers_version": "0.26.1",
4
- "_name_or_path": "i2vgen-xl/vae",
5
- "act_fn": "silu",
6
- "block_out_channels": [
7
- 128,
8
- 256,
9
- 512,
10
- 512
11
- ],
12
- "down_block_types": [
13
- "DownEncoderBlock2D",
14
- "DownEncoderBlock2D",
15
- "DownEncoderBlock2D",
16
- "DownEncoderBlock2D"
17
- ],
18
- "force_upcast": true,
19
- "in_channels": 3,
20
- "latent_channels": 4,
21
- "layers_per_block": 2,
22
- "norm_num_groups": 32,
23
- "out_channels": 3,
24
- "sample_size": 768,
25
- "scaling_factor": 0.18125,
26
- "up_block_types": [
27
- "UpDecoderBlock2D",
28
- "UpDecoderBlock2D",
29
- "UpDecoderBlock2D",
30
- "UpDecoderBlock2D"
31
- ]
32
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
vae/diffusion_pytorch_model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:3e4c08995484ee61270175e9e7a072b66a6e4eeb5f0c266667fe1f45b90daf9a
3
- size 167335342
 
 
 
 
vae/diffusion_pytorch_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:2aa1f43011b553a4cba7f37456465cdbd48aab7b54b9348b890e8058ea7683ec
3
- size 334643268