Papers
arxiv:2309.15103

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Published on Sep 26, 2023
Β· Submitted by akhaliq on Sep 27, 2023
#2 Paper of the day
Authors:
Xin Ma ,
,
,
,
Bo Dai ,
,

Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

Community

Here is an ML-generated summary

Objective
The paper proposes LaVie, a cascaded latent diffusion model framework for high-quality text-to-video generation, by leveraging a pre-trained text-to-image model as initialization.

The key contributions are: 1) An efficient temporal module design using temporal self-attention and rotary positional encoding. 2) A joint image-video fine-tuning strategy to mitigate catastrophic forgetting. 3) A new text-video dataset Vimeo25M with 25M high-resolution videos.

Insights

  • Simple temporal self-attention coupled with rotary positional encoding effectively captures temporal correlations. More complex architectures provide marginal gains.
  • Joint image-video fine-tuning plays a pivotal role in producing high-quality and creative results. Direct video-only fine-tuning leads to catastrophic forgetting.
  • Joint fine-tuning enables large-scale knowledge transfer from images to videos, including styles, scenes, and characters.
  • High-quality dataset like Vimeo25M is critical for training high-fidelity T2V models.

Implementation

  • Base T2V model initialized from pre-trained Stable Diffusion and adapted via pseudo-3D convolutions and spatio-temporal transformers.
  • Temporal interpolation model trained to increase frame rate 4x. Takes base video as input and outputs interpolated 61 frames.
  • Video super-resolution model fine-tuned to increase spatial resolution to 1280x2048. Leverages image super-resolution model as initialization.
  • Joint image-video fine-tuning utilized during training to enable knowledge transfer.

Results
Both quantitative and qualitative evaluations demonstrate LaVie achieves state-of-the-art performance in zero-shot text-to-video generation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

LaVie: Revolutionizing Video Generation with AI-Powered Models

Links πŸ”—:

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.15103 in a dataset README.md to link it from this page.

Spaces citing this paper 5

Collections including this paper 7