metadata

license: apache-2.0
language:
  - en
library_name: diffusers
pipeline_tag: text-to-video

Gallery · GitHub · Blog · Paper · Discord · Join Waitlist (Try it on Discord!)

Gallery

For more demos and corresponding prompts, see the Allegro Gallery.

Key Feature

Open Source: Full model weights and code available to the community, Apache 2.0!
Versatile Content Creation: Capable of generating a wide range of content, from close-ups of humans and animals to diverse dynamic scenes.
High-Quality Output: Generate detailed 6-second videos at 15 FPS with 720x1280 resolution, which can be interpolated to 30 FPS with EMA-VFI.
Small and Efficient: Features a 175M parameter VideoVAE and a 2.8B parameter VideoDiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB of GPU memory in BF16 mode with CPU offloading. Context length is 79.2K, equivalent to 88 frames.

Model info

Model	Allegro
Description	Text-to-Video Generation Model
Download	Hugging Face
Parameter	VAE: 175M
Parameter	DiT: 2.8B
Inference Precision	VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)
Inference Precision	DiT/T5: BF16/FP32/TF32
Context Length	79.2K
Resolution	720 x 1280
Frames	88
Video Length	6 seconds @ 15 FPS
Single GPU Memory Usage	9.3G BF16 (with cpu_offload)

Quick start

Download the Allegro GitHub code.
Install the necessary requirements.
- Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4. For details, see requirements.txt.
- It is recommended to use Anaconda to create a new environment (Python >= 3.10) to run the following example.
Download the Allegro model weights. Before diffuser integration, use git lfs or snapshot_download.

Run inference.

python single_inference.py \
--user_prompt 'A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats.' \
--save_path ./output_videos/test_video.mp4
--vae your/path/to/vae \
--dit your/path/to/transformer \
--text_encoder your/path/to/text_encoder \
--tokenizer your/path/to/tokenizer \
--guidance_scale 7.5 \
--num_sampling_steps 100 \
--seed 42

Use '--enable_cpu_offload' to offload the model into CPU for less GPU memory cost (about 9.3G, compared to 27.5G if CPU offload is not enabled), but the inference time will increase significantly.

(Optional) Interpolate the video to 30 FPS.

It is recommended to use EMA-VFI to interpolate the video from 15 FPS to 30 FPS.

For better visual quality, please use imageio to save the video.

License

This repo is released under the Apache 2.0 License.