Dataset Preparation

Stage2——Video-language Alignment

Pretraining

The public portion of the pre-trained dataset we use is as follows：

CC3M images
CC12M images
SBU images
VG images
COCO images
WebVid videos
InternVid videos

Evaluation

For evaluation, we follow VINDLU to prepare the datasets, but we DO NOT compress the videos and images. We use the original data and load the JSON files. And We use the same JSON files provided by VINDLU.

Video-Text Retrieval

MSRVTT videos
MSVD videos
ActivityNet videos
DiDeMo videos

Stage3——VideoChat

Pretraining

VideoChat-IT

Evaluation

MVBench

Please refer to MVBench