Mind the Time: Temporally-Controlled Multi-Event Video Generation
Abstract
Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation (2024)
- Motion Control for Enhanced Complex Action Video Generation (2024)
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation (2024)
- Tell What You Hear From What You See -- Video to Audio Generation Through Text (2024)
- PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation (2024)
- Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction (2024)
- LVD-2M: A Long-take Video Dataset with Temporally Dense Captions (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper