# Dataset Preparation # Stage2——Video-language Alignment ## Pretraining The public portion of the pre-trained dataset we use is as follows: - [CC3M images](https://github.com/google-research-datasets/conceptual-captions) - [CC12M images](https://github.com/google-research-datasets/conceptual-12m) - [SBU images](https://www.cs.rice.edu/~vo9/sbucaptions/) - [VG images](https://visualgenome.org/api/v0/api_home.html) - [COCO images](https://cocodataset.org/#download) - [WebVid videos](https://github.com/m-bain/webvid) - [InternVid videos](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) ## Evaluation For evaluation, we follow [VINDLU](https://github.com/klauscc/VindLU/) to prepare the datasets, but we **DO NOT** compress the videos and images. We use the original data and load the JSON files. And We use the same **JSON** files provided by [VINDLU](https://drive.google.com/drive/folders/12bC7WotvwyTG4pVvYeU4iZzmBLP1-6d9). ### Video-Text Retrieval - [MSRVTT videos](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip) - [MSVD videos](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) - [ActivityNet videos](http://activity-net.org/download.html) - [DiDeMo videos](https://github.com/LisaAnne/LocalizingMoments) # Stage3——VideoChat ## Pretraining - [VideoChat-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) ## Evaluation ### MVBench Please refer to [MVBench](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)