Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.

We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.

Open-Sora Plan

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前版本离目标差距仍然较大，仍需持续完善和快速迭代，欢迎Pull request！目前代码同时支持使用国产AI计算系统（华为昇腾）进行完整的训练和推理。基于昇腾训练出的模型，也可输出持平业界的视频质量。

If you like our project, please give us a star ⭐ on GitHub for latest update.

📣 News

[2024.07.24] 🔥🔥🔥 v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Checking out our latest report.
[2024.05.27] 🎉 We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.
[2024.04.09] 🤝 Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos.
[2024.04.07] 🎉🎉🎉 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.
[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
[2024.03.01] 🤗 We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to watch 👀 this repository for the latest updates.

Report v1.2.0

In May 2024, we launched Open-Sora-Plan v1.1.0, featuring a 2+1D model architecture that could be quickly utilized for exploratory training in text-to-video generation tasks. However, when handling dense visual tokens, the 2+1D architecture could not simultaneously process spatial and temporal dimensions. Therefore, we transitioned to a 3D full attention architecture, which better captures the joint spatial-temporal features. Although this version is experimental, it advances video generation architecture to a new realm, leading us to release it as v1.2.0.

Compared to previous video generation models, Open-Sora-Plan v1.2.0 offers the following improvements:

Better compressed visual representations. We optimized the structure of CausalVideoVAE, which now delivers enhanced performance and higher inference efficiency.
Better video generation architecture. Instead of 2+1D, we use a diffusion model with a 3D full attention architecture, which provides a better understanding of the world.

Open-Source Release

We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model will be made publicly available.

Code: All training scripts and sample scripts.
Model: Both Diffusion Model and CasualVideoVAE here.

Gallery

93×1280×720 Text-to-Video Generation. The video quality has been compressed for playback on GitHub.

Detailed Technical Report

CasualVideoVAE

Model Structure

The VAE in version 1.2.0 maintains the overall architecture of the previous version but merges the temporal and spatial downsampling layers. In version 1.1.0, we performed spatial downsampling (stride=1,2,2) followed by temporal downsampling (stride=2,1,1). In version 1.2.0, we conduct both spatial and temporal downsampling simultaneously (stride=2,2,2) and perform spatial-temporal upsampling in the decoder (interpolate_factor=2,2,2).

Due to the absence of additional convolutions during downsampling and upsampling, this method more seamlessly inherits the weights from the SD2.1 VAE, leading to improved initialization of our VAE.

Training Details

As with v1.1.0, we initialize from the SD2.1 VAE using tail initialization. We perform the first phase of training on the Kinetic400 video dataset, then use the EMA weights from this phase to initialize the second phase, which is fine-tuned on high-quality data (collected in v1.1.0). All training is conducted on 25-frame 256×256 videos using one A100 node.

Training stage	Dataset	Training steps
1	K400	200,000
2	collected in v1.1.0	450,000

Evaluation

We evaluated our VAE on the validation sets of two video datasets: Webvid and Panda70m, and compared it with our v1.1.0, SD2.1 VAE, CV-VAE, and Open-Sora's VAE. The Webvid validation set contains 5k videos, while the Panda70m validation set has 6k videos. The videos were resized to 256 pixels on the short side, center-cropped to 256x256, and then 33 consecutive frames were extracted. We used PSNR, SSIM, and LPIPS metrics, and measured the encoding speed on an A100 GPU. The specific results are as follows:

WebVid

Model	Compress Ratio	PNSR↑	SSIM↑	LPIPS↓
SD2-1 VAE	1x8x8	30.19	0.8379	0.0568
SVD VAE	1x8x8	31.15	0.8686	0.0547
CV-VAE	4x8x8	30.76	0.8566	0.0803
Open-Sora VAE	4x8x8	31.12	0.8569	0.1003
Open-Sora Plan v1.1	4x8x8	30.26	0.8597	0.0551
Open-Sora Plan v1.2	4x8x8	31.16	0.8694	0.0586

Panda70M

Model	Compress Ratio	PNSR↑	SSIM↑	LPIPS↓
SD2-1 VAE	1x8x8	30.40	0.8894	0.0396
SVD VAE	1x8x8	31.00	0.9058	0.0379
CV-VAE	4x8x8	29.57	0.8795	0.0673
Open-Sora VAE	4x8x8	31.06	0.8969	0.0666
Open-Sora Plan v1.1	4x8x8	29.16	0.8844	0.0481
Open-Sora Plan v1.2	4x8x8	30.49	0.8970	0.0454

Encode Time on A100

Input Size	CV-VAE	Open-Sora	Open-Sora Plan v1.1	Open-Sora Plan v1.2
33x256x256	0.186	0.147	0.104	0.102
81x256x256	0.465	0.357	0.243	0.242

Training Text-to-Video Diffusion Model

Model Structure

The most significant change is that we replaced all 2+1D Transformer blocks with 3D full attention blocks. Each video is first processed by a patch embedding layer, which downsamples the spatial dimensions by a factor of 2. The video is then flattened into a one-dimensional sequence across the frame, width, and height dimensions. We replaced T5-XXL with mT5-XXL to enhance multilingual adaptation. Additionally, we incorporated RoPE.

Sequence Parallelism

Due to the high computational complexity of 3D full attention, we must allocate a video across 2 GPUs for parallel processing when training with long-duration and high-resolution videos. We can control the number of GPUs used for a video sample by adjusting the batch size on a node. For example, with sp_size=8 and train_sp_batch_size=4, 2 GPUs are used for a single sample. We support sequence parallelism for both training and inference.

Training on 93×720p, we report speed on H100.

GPU （sp_size）	batch size	Enable sp	Train_sp_batch_size	Speed	Step per day
8	8	×	-	100s/step	~850
8	-	√	4	53s/step	~1600
8	-	√	2	27s/step	~3200

Inference on 93×720p, we report speed on H100.

Size	1 GPU	8 GPUs
29×720p	420s/100step	80s/100step
93×720p	3400s/100step	450s/100step

Dynamic training

Deep neural networks are typically trained using batched inputs. For efficient hardware processing, batch shapes are fixed, leading to a fixed data size. This requires either cropping or padding images to a uniform size, both of which have drawbacks: cropping degrades performance, while padding is inefficient and results in significant information loss. Generally, there are three methods for training with arbitrary token counts: Patch n' Pack, bucket, and pad-mask.

Patch n' Pack (NaViT): bypasses the fixed sequence length limitation by combining tokens from multiple samples into a new sample. This approach allows variable-resolution images while maintaining aspect ratios by packaging multiple samples together, thereby reducing training time and enhancing performance and flexibility. However, this method involves significant code modifications and requires re-adaptation when exploring different model architectures in fields with unstable model designs.

Bucket (Pixart-alpha, Open-Sora): This method packages data of different resolutions into buckets, sampling batches from each bucket to ensure same resolution within each batch. It requires minimal code modifications to the model, mainly adjusting the data sampling strategy.

Pad-mask (FiT, our v1.0/v1.1): This method sets a maximum resolution and pads all data to this resolution, generating a corresponding mask. Although the approach is straightforward, it is computationally inefficient.

We believe that current video generation models are still in an exploratory phase. Extensive modifications to model code during this period can incur unnecessary development costs. The pad-mask method, while straightforward, is computationally inefficient and can waste resources in video, which involves dense computations. Ultimately, we chose the bucket strategy, which requires no modifications to the model code. Next, we will explain how our bucket strategy supports arbitrary lengths and resolutions. For simplicity, we will use video duration as an example:

We define a megabatch as the total data processed in a single step across all GPUs. A megabatch can be divided into multiple batches, with each batch corresponding to the data processed by a single GPU.

Sort by frame: The first step is to count the number of frames in all video data and sort them. This step aims to group similar data together, with sorting being one method to achieve this.

Group megabatch: Next, all data is divided into groups, each forming a megabatch. Since all data is pre-sorted, most videos within a megabatch have the same number of frames. However, there will always be boundary cases, such as having both 61-frame and 1-frame videos in a single megabatch.

Re-organize megabatch: We re-organize these special megabatches, which actually constitute a small proportion. We randomly replace the minority data in the megabatch with the majority data, thus re-organizing it into a megabatch with same frame counts.

Shuffle megabatch: To ensure data randomness, we shuffle both within each megabatch and between different megabatches.

When supporting dynamic resolutions, we simply replace each sample's frame sequence with (frame × height × width). This method ensures that the data dimension processed by each GPU in every step is the same, preventing situations where GPU1 waits for GPU0 to finish processing a longer video. Moreover, it is entirely decoupled from the model code, serving as a plug-and-play video sampling strategy.

Training stage

Similar to previous work, we use a multi-stage training approach. With the 3D DiT architecture, all parameters can be transferred from images to videos without loss. To explore training costs, all parameters of the diffusion model are trained from scratch. Therefore, we first train an text-to-image model, using the training strategy from Pixart-alpha.

The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 7 million samples from panda70m, filtered for aesthetic quality and motion. Finally, we refine the model with a higher-quality (HQ) subset of 1 million samples for fine-tuning 93-frame 720p videos. Below is our training card.

Name	Stage 1	Stage 2	Stage 3	Stage 4	Stage 5
Training Video Size	1×320×240	1×640×480	29×640×480	29×1280×720	93×1280×720
Training Step	146k	200k	30k	21k	3k
Compute (#Num x #Hours)	32 Ascend × 81	32 Ascend × 142	128 Ascend × 38	256 H100 × 64	256 H100 × 84
Checkpoint	-	-	-	HF	HF
Log	-	-	wandb	wandb	wandb
Training Data	10M SAM	5M internal image data	6M HQ Panda70M	6M HQ Panda70M	1M HQ Panda70M and 100k HQ data (collected in v1.1.0)

Additionally, we fine-tuned 3.5k steps from the final 93×720p to get 93×480p for community research use.

Training Image-to-Video Diffusion Model

Coming soon...

Future Work and Discussion

CasualVideoVAE

We observed that high-frequency motion information in videos tends to exhibit jitter, and increasing training duration and data volume does not significantly alleviate this issue. In videos, compressing the duration while maintaining the original latent dimension can lead to significant information loss. A more robust VAE will be released in the next version.

Diffusion Model

We replaced T5 with mT5 to enhance multilingual capabilities, but this capability is limited as our training data is currently only in English. The multilingual ability primarily comes from the mT5 mapping space. We will explore additional text encoders and expand the data in the next steps.

Our model performs well in generating character consistency, likely due to panda70m being a character-centric dataset. However, it still shows poor performance in text consistency and object generalization. We suspect this may be due to the limited amount of data the model has seen, as evidenced by the non-convergence of the loss in the final stage. We hope to collaborate with the open-source community to optimize the 3D DiT architecture.