File size: 4,442 Bytes
85fe96a
 
3abf421
 
f24bb22
85fe96a
3f99bb7
a75220d
3f99bb7
d4b5e81
8278082
 
d4b5e81
1e280cc
3abf421
bcdde5f
3abf421
1e280cc
3abf421
bf031d5
bcdde5f
00d6b36
bcdde5f
 
ef79da9
 
41f1943
2afcdf8
 
 
 
 
 
 
e522c75
2afcdf8
 
 
74c2f27
2afcdf8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcdde5f
2afcdf8
 
 
 
 
 
 
 
 
 
 
0893849
2afcdf8
 
 
 
 
 
 
ef79da9
 
21e440d
bcdde5f
21e440d
8a57a3c
 
319e9cf
8a57a3c
319e9cf
21e440d
dc9c070
21e440d
 
0893849
 
 
 
5e69637
0893849
 
 
 
 
 
 
 
5e69637
 
 
 
 
 
 
 
ef79da9
 
3abf421
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
language:
- en
library_name: diffusers
---
<p align="center">
<img src="https://huggingface.co/rhymes-ai/Allegro/resolve/main/banner_white.gif">
</p>
<p align="center">
 <a href="https://rhymes.ai/allegro_gallery" target="_blank"> Gallery</a><a href="https://github.com/rhymes-ai/Allegro" target="_blank">GitHub</a><a href="https://rhymes.ai/blog-details/allegro-advanced-video-generation-model" target="_blank">Blog</a><a href="https://arxiv.org/abs/2410.15458" target="_blank">Paper</a><a href="https://discord.com/invite/u8HxU23myj" target="_blank">Discord</a><a href="https://docs.google.com/forms/d/e/1FAIpQLSfq4Ez48jqZ7ncI7i4GuL7UyCrltfdtrOCDnm_duXxlvh5YmQ/viewform" target="_blank">Join Waitlist</a> (Try it on Discord!)  
   
</p> 

# Gallery
<img src="https://huggingface.co/rhymes-ai/Allegro/resolve/main/gallery.gif" width="1000" height="800"/>For more demos and corresponding prompts, see the [Allegro Gallery](https://rhymes.ai/allegro_gallery).


# Key Feature 

- **Open Source**: Full [model weights](https://huggingface.co/rhymes-ai/Allegro) and [code](https://github.com/rhymes-ai/Allegro) available to the community, Apache 2.0!
- **Versatile Content Creation**: Capable of generating a wide range of content, from close-ups of humans and animals to diverse dynamic scenes.
- **High-Quality Output**: Generate detailed 6-second videos at 15 FPS with 720x1280 resolution, which can be interpolated to 30 FPS with [EMA-VFI](https://github.com/MCG-NJU/EMA-VFI).
- **Small and Efficient**: Features a 175M parameter VideoVAE and a 2.8B parameter VideoDiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB of GPU memory in BF16 mode with CPU offloading. Context length is 79.2K, equivalent to 88 frames.

# Model info 

<table>
  <tr>
    <th>Model</th>
    <td>Allegro</td>
  </tr>
  <tr>
    <th>Description</th>
    <td>Text-to-Video Generation Model</td>
  </tr>
  <tr>
    <th>Download</th>
    <td><a href="https://huggingface.co/rhymes-ai/Allegro">Hugging Face</a></td>
  </tr>
  <tr>
    <th rowspan="2">Parameter</th>
    <td>VAE: 175M</td>
  </tr>
  <tr>
    <td>DiT: 2.8B</td>
  </tr>
  <tr>
    <th rowspan="2">Inference Precision</th>
    <td>VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)</td>
  </tr>
  <tr>
    <td>DiT/T5: BF16/FP32/TF32</td>
  </tr>
  <tr>
    <th>Context Length</th>
    <td>79.2K</td>
  </tr>
  <tr>
    <th>Resolution</th>
    <td>720 x 1280</td>
  </tr>
  <tr>
    <th>Frames</th>
    <td>88</td>
  </tr>
  <tr>
    <th>Video Length</th>
    <td>6 seconds @ 15 FPS</td>
  </tr>
  <tr>
    <th>Single GPU Memory Usage</th>
    <td>9.3G BF16 (with cpu_offload)</td>
  </tr>
</table>


# Quick start

1. Download the [Allegro GitHub code](https://github.com/rhymes-ai/Allegro).
   
2. Install the necessary requirements.
     
   - Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4. For details, see [requirements.txt](https://github.com/rhymes-ai/Allegro/blob/main/requirements.txt).  
       
   - It is recommended to use Anaconda to create a new environment (Python >= 3.10) to run the following example.  

3. Download the [Allegro model weights](https://huggingface.co/rhymes-ai/Allegro). Before diffuser integration, use git lfs or snapshot_download.
 
4. Run inference.
   
    ```python
    python single_inference.py \
    --user_prompt 'A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats.' \
    --save_path ./output_videos/test_video.mp4
    --vae your/path/to/vae \
    --dit your/path/to/transformer \
    --text_encoder your/path/to/text_encoder \
    --tokenizer your/path/to/tokenizer \
    --guidance_scale 7.5 \
    --num_sampling_steps 100 \
    --seed 42
    ```
  
    Use '--enable_cpu_offload' to offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly.

5. (Optional) Interpolate the video to 30 FPS.

    It is recommended to use [EMA-VFI](https://github.com/MCG-NJU/EMA-VFI) to interpolate the video from 15 FPS to 30 FPS.
  
    For better visual quality, please use imageio to save the video.

# License
This repo is released under the Apache 2.0 License.