rain1011 commited on
Commit
e5699c0
1 Parent(s): 3f1b00f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -3
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-video
4
+ tags:
5
+ - text-to-image
6
+ - image-to-video
7
+ - flux
8
+ ---
9
+
10
+ # ⚡️Pyramid Flow miniFLUX⚡️
11
+
12
+ [[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[Code 🚀]](https://github.com/jy0205/Pyramid-Flow) [[SD3 Model ⚡️]](https://huggingface.co/rain1011/pyramid-flow-sd3) [[demo 🤗](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow)]
13
+
14
+ This is the model repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on open-source datasets, it generates high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
15
+
16
+ <table class="center" border="0" style="width: 100%; text-align: left;">
17
+ <tr>
18
+ <th>10s, 768p, 24fps</th>
19
+ <th>5s, 768p, 24fps</th>
20
+ <th>Image-to-video</th>
21
+ </tr>
22
+ <tr>
23
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v_10s/fireworks.mp4" autoplay muted loop playsinline></video></td>
24
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/trailer.mp4" autoplay muted loop playsinline></video></td>
25
+ <td><video src="https://pyramid-flow.github.io/static/videos/i2v/sunday.mp4" autoplay muted loop playsinline></video></td>
26
+ </tr>
27
+ </table>
28
+
29
+ ## News
30
+
31
+ * `2024.10.29` ⚡️⚡️⚡️ We release [training code](https://github.com/jy0205/Pyramid-Flow?tab=readme-ov-file#training) and [new model checkpoints](https://huggingface.co/rain1011/pyramid-flow-miniflux) with FLUX structure trained from scratch.
32
+
33
+ > We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint and 384p video checkpoint. These checkpoints are trained with synthetic data from FLUX. We will release 768p video checkpoint in a few days.
34
+ * `2024.10.11` 🤗🤗🤗 [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
35
+ * `2024.10.10` 🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
36
+
37
+ ## Installation
38
+
39
+ We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2, and we are actively working to support a wider range of versions.
40
+
41
+ ```bash
42
+ git clone https://github.com/jy0205/Pyramid-Flow
43
+ cd Pyramid-Flow
44
+
45
+ # create env using conda
46
+ conda create -n pyramid python==3.8.10
47
+ conda activate pyramid
48
+ pip install -r requirements.txt
49
+ ```
50
+
51
+ Then, download the model from [Huggingface](https://huggingface.co/rain1011) (there are two variants: [miniFLUX](https://huggingface.co/rain1011/pyramid-flow-miniflux) or [SD3](https://huggingface.co/rain1011/pyramid-flow-sd3)). The miniFLUX models support 1024p image and 384p video generation, and the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, while the 768p checkpoint generates up to 10-second video at 24FPS.
52
+
53
+ ```python
54
+ from huggingface_hub import snapshot_download
55
+
56
+ model_path = 'PATH' # The local directory to save downloaded checkpoint
57
+ snapshot_download("rain1011/pyramid-flow-miniflux", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
58
+ ```
59
+
60
+ ## Usage
61
+
62
+ For inference, we provide Gradio demo, single-GPU, multi-GPU, and Apple Silicon inference code, as well as VRAM-efficient features such as CPU offloading. Please check our [code repository](https://github.com/jy0205/Pyramid-Flow?tab=readme-ov-file#inference) for usage.
63
+
64
+ Below is a simplified two-step usage procedure. First, load the downloaded model:
65
+
66
+ ```python
67
+ import torch
68
+ from PIL import Image
69
+ from pyramid_dit import PyramidDiTForVideoGeneration
70
+ from diffusers.utils import load_image, export_to_video
71
+
72
+ torch.cuda.set_device(0)
73
+ model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16 (not support fp16 yet)
74
+
75
+ model = PyramidDiTForVideoGeneration(
76
+ 'PATH', # The downloaded checkpoint dir
77
+ model_dtype,
78
+ model_variant='diffusion_transformer_384p', # SD3 supports 'diffusion_transformer_768p'
79
+ )
80
+
81
+ model.vae.enable_tiling()
82
+ # model.vae.to("cuda")
83
+ # model.dit.to("cuda")
84
+ # model.text_encoder.to("cuda")
85
+
86
+ # if you're not using sequential offloading bellow uncomment the lines above ^
87
+ model.enable_sequential_cpu_offload()
88
+ ```
89
+
90
+ Then, you can try text-to-video generation on your own prompts:
91
+
92
+ ```python
93
+ prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
94
+
95
+ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
96
+ frames = model.generate(
97
+ prompt=prompt,
98
+ num_inference_steps=[20, 20, 20],
99
+ video_num_inference_steps=[10, 10, 10],
100
+ height=384,
101
+ width=640,
102
+ temp=16, # temp=16: 5s, temp=31: 10s
103
+ guidance_scale=9.0, # The guidance for the first frame, set it to 7 for 384p variant
104
+ video_guidance_scale=5.0, # The guidance for the other video latent
105
+ output_type="pil",
106
+ save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
107
+ )
108
+
109
+ export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
110
+ ```
111
+
112
+ As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
113
+
114
+ ```python
115
+ image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((640, 384))
116
+ prompt = "FPV flying over the Great Wall"
117
+
118
+ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
119
+ frames = model.generate_i2v(
120
+ prompt=prompt,
121
+ input_image=image,
122
+ num_inference_steps=[10, 10, 10],
123
+ temp=16,
124
+ video_guidance_scale=4.0,
125
+ output_type="pil",
126
+ save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
127
+ )
128
+
129
+ export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
130
+ ```
131
+
132
+ ## Usage tips
133
+
134
+ * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
135
+ * The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.
136
+ * For 10-second video generation, we recommend using a guidance scale of 7 and a video guidance scale of 5.
137
+
138
+ ## Gallery
139
+
140
+ The following video examples are generated at 5s, 768p, 24fps. For more results, please visit our [project page](https://pyramid-flow.github.io).
141
+
142
+ <table class="center" border="0" style="width: 100%; text-align: left;">
143
+ <tr>
144
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/tokyo.mp4" autoplay muted loop playsinline></video></td>
145
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/eiffel.mp4" autoplay muted loop playsinline></video></td>
146
+ </tr>
147
+ <tr>
148
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/waves.mp4" autoplay muted loop playsinline></video></td>
149
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/rail.mp4" autoplay muted loop playsinline></video></td>
150
+ </tr>
151
+ </table>
152
+
153
+ ## Acknowledgement
154
+
155
+ We are grateful for the following awesome projects when implementing Pyramid Flow:
156
+
157
+ * [SD3 Medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium) and [Flux 1.0](https://huggingface.co/black-forest-labs/FLUX.1-dev): State-of-the-art image generation models based on flow matching.
158
+ * [Diffusion Forcing](https://boyuan.space/diffusion-forcing) and [GameNGen](https://gamengen.github.io): Next-token prediction meets full-sequence diffusion.
159
+ * [WebVid-10M](https://github.com/m-bain/webvid), [OpenVid-1M](https://github.com/NJU-PCALab/OpenVid-1M) and [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan): Large-scale datasets for text-to-video generation.
160
+ * [CogVideoX](https://github.com/THUDM/CogVideo): An open-source text-to-video generation model that shares many training details.
161
+ * [Video-LLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2): An open-source video LLM for our video recaptioning.
162
+
163
+ ## Citation
164
+
165
+ Consider giving this repository a star and cite Pyramid Flow in your publications if it helps your research.
166
+ ```
167
+ @article{jin2024pyramidal,
168
+ title={Pyramidal Flow Matching for Efficient Video Generative Modeling},
169
+ author={Jin, Yang and Sun, Zhicheng and Li, Ningyuan and Xu, Kun and Xu, Kun and Jiang, Hao and Zhuang, Nan and Huang, Quzhe and Song, Yang and Mu, Yadong and Lin, Zhouchen},
170
+ jounal={arXiv preprint arXiv:2410.05954},
171
+ year={2024}
172
+ }
173
+ ```