mazpie's picture
Initial commit
2d9a728
|
raw
history blame
1.52 kB
# Dataset Preparation
# Stage2——Video-language Alignment
## Pretraining
The public portion of the pre-trained dataset we use is as follows:
- [CC3M images](https://github.com/google-research-datasets/conceptual-captions)
- [CC12M images](https://github.com/google-research-datasets/conceptual-12m)
- [SBU images](https://www.cs.rice.edu/~vo9/sbucaptions/)
- [VG images](https://visualgenome.org/api/v0/api_home.html)
- [COCO images](https://cocodataset.org/#download)
- [WebVid videos](https://github.com/m-bain/webvid)
- [InternVid videos](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
## Evaluation
For evaluation, we follow [VINDLU](https://github.com/klauscc/VindLU/) to prepare the datasets, but we **DO NOT** compress the videos and images. We use the original data and load the JSON files. And We use the same **JSON** files provided by [VINDLU](https://drive.google.com/drive/folders/12bC7WotvwyTG4pVvYeU4iZzmBLP1-6d9).
### Video-Text Retrieval
- [MSRVTT videos](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip)
- [MSVD videos](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)
- [ActivityNet videos](http://activity-net.org/download.html)
- [DiDeMo videos](https://github.com/LisaAnne/LocalizingMoments)
# Stage3——VideoChat
## Pretraining
- [VideoChat-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT)
## Evaluation
### MVBench
Please refer to [MVBench](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)