mazpie's picture
Initial commit
2d9a728
|
raw
history blame
1.52 kB

Dataset Preparation

Stage2——Video-language Alignment

Pretraining

The public portion of the pre-trained dataset we use is as follows:

Evaluation

For evaluation, we follow VINDLU to prepare the datasets, but we DO NOT compress the videos and images. We use the original data and load the JSON files. And We use the same JSON files provided by VINDLU.

Video-Text Retrieval

Stage3——VideoChat

Pretraining

Evaluation

MVBench

Please refer to MVBench