arxiv:2408.02629

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Published on Aug 5

· Submitted by

akhaliq on Aug 6

Upvote

Authors:

Luozheng Qin ,

Abstract

The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Aug 6

https://sais-fuxi.github.io/projects/vidgen-1m/

bigcomma

Aug 6

•

edited Aug 6

The video shows a highway winding through a lush green landscape. The road is surrounded by dense trees and vegetation both sides. The sky is overcast, and the mountains in the distance are partially obscured by clouds. The highway appears to be in good condition, with clear lane markings. There are several vehicles traveling on the road, including cars and trucks. The colors in the video are predominantly green from the trees and grey from the road and sky.