File size: 4,142 Bytes
85fe96a 3abf421 f24bb22 85fe96a 3f99bb7 d4b5e81 b2a21d4 d4b5e81 1e280cc 3abf421 1e280cc 3abf421 e522c75 bf031d5 ef79da9 41f1943 2afcdf8 e522c75 2afcdf8 ef79da9 b2a21d4 e522c75 b2a21d4 e522c75 b2a21d4 e522c75 ef79da9 3abf421 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
license: apache-2.0
language:
- en
library_name: diffusers
---
<p align="center">
<img src="https://huggingface.co/rhymes-ai/Allegro/resolve/main/Rgif.gif" width="500" height="400"/>
</p>
<p align="center">
<a href="https://rhymes.ai/" target="_blank"> Gallery</a> 路 <a href="https://github.com/rhymes-ai/Aria" target="_blank">GitHub</a> 路 <a href="https://www.rhymes.ai/blog-details/" target="_blank">Blog</a> 路 <a href="https://arxiv.org/pdf/2410.05993" target="_blank">Paper</a> 路 <a href="https://discord" target="_blank">Discord</a>
</p>
# Gallery
<img src="https://huggingface.co/rhymes-ai/Allegro/resolve/main/gallery.gif" width="1000" height="800"/>For more demos and corresponding prompts, see the [Allegro Gallery](TBD).
# Key Feature
- **High-Quality Output**: Generate detailed 6-second videos at 15 FPS with 720x1280 resolution, which can be interpolated to 30 FPS with EMA-VFI.
- **Small and Efficient**: Features a 175M parameter VAE and a 2.8B parameter DiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB of GPU memory in BF16 mode with CPU offloading.
- **Extensive Context Length**: Handles up to 79.2k tokens, providing rich and comprehensive text-to-video generation capabilities.
- **Versatile Content Creation**: Capable of generating a wide range of content, from close-ups of humans and animals to diverse dynamic scenes.
# Model info
<table>
<tr>
<th>Model</th>
<td>Allegro</td>
</tr>
<tr>
<th>Description</th>
<td>Text-to-Video Generation Model</td>
</tr>
<tr>
<th>Download</th>
<td><HF link - TBD></td>
</tr>
<tr>
<th rowspan="2">Parameter</th>
<td>VAE: 175M</td>
</tr>
<tr>
<td>DiT: 2.8B</td>
</tr>
<tr>
<th rowspan="2">Inference Precision</th>
<td>VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)</td>
</tr>
<tr>
<td>DiT/T5: BF16/FP32/TF32</td>
</tr>
<tr>
<th>Context Length</th>
<td>79.2k</td>
</tr>
<tr>
<th>Resolution</th>
<td>720 x 1280</td>
</tr>
<tr>
<th>Frames</th>
<td>88</td>
</tr>
<tr>
<th>Video Length</th>
<td>6 seconds @ 15 fps</td>
</tr>
<tr>
<th>Single GPU Memory Usage</th>
<td>9.3G BF16 (with cpu_offload)</td>
</tr>
</table>
# Quick start
You can quickly get started with Allegro using the Hugging Face Diffusers library.
For more tutorials, see Allegro GitHub (link-tbd).
1. Install necessary requirements. Please refer to [requirements.txt](https://github.com/rhymes-ai) on Allegro GitHub.
2. Perform inference on a single GPU.
```python
from diffusers import DiffusionPipeline
import torch
allegro_pipeline = DiffusionPipeline.from_pretrained(
"rhymes-ai/Allegro", trust_remote_code=True, torch_dtype=torch.bfloat16
).to("cuda")
allegro_pipeline.vae = allegro_pipeline.vae.to(torch.float32)
prompt = "a video of an astronaut riding a horse on mars"
positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked),
{}
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo,
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""
negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality,
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""
num_sampling_steps, guidance_scale, seed = 100, 7.5, 42
user_prompt = positive_prompt.format(args.user_prompt.lower().strip())
out_video = allegro_pipeline(
user_prompt,
negative_prompt=negative_prompt,
num_frames=88,
height=720,
width=1280,
num_inference_steps=num_sampling_steps,
guidance_scale=guidance_scale,
max_sequence_length=512,
generator = torch.Generator(device="cuda:0").manual_seed(seed)
).video[0]
imageio.mimwrite("test_video.mp4", out_video, fps=15, quality=8)
```
Tip:
- It is highly recommended to use a video frame interpolation model (such as EMA-VFI) to enhance the result to 30 FPS.
- For more tutorials, see [Allegro GitHub](https://github.com/rhymes-ai).
# License
This repo is released under the Apache 2.0 License.
|