Spaces:
Running
on
Zero
Running
on
Zero
File size: 5,937 Bytes
2d9a728 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# Model Zoo
## Pretraining
For $\text{InternVideo2}_{s2}$, we load those models of $\text{InternVideo2}_{s1}$ and further pretrain them on multi-modality datasets.
For $\text{InternVideo2}_{clip}$, we load those models of $\text{InternVideo2}_{s2}$.
| Model | Setting | Model | Pretraining Script |
| -------- | ----------- | ------ | ------------- |
| $\text{InternVideo2}_{s2}$-1B | IV-25.5M | [:hugs: HF link](https://huggingface.co/OpenGVLab/InternVideo2/blob/main/InternVideo2-stage2_1b-224p-f4.pt) | [script](scripts/pretraining/stage2/1B/run.sh) |
| $\text{InternVideo2}_{clip}$-1B | IV-25.5M | TBD | [script](scripts/pretraining/clip/1B/run.sh) |
| $\text{InternVideo2}_{s2}$-6B | IV-400M | TBD | [script](scripts/pretraining/stage2/6B/run.sh) |
| $\text{InternVideo2}_{clip}$-6B | IV-400M | TBD | [script](scripts/pretraining/clip/6B/run.sh) |
### Zero-shot Evaluation
## Zero-Shot Video-Text Retrieval
| Model | Dataset | T2V | V2T | Evaluation Script |
| -------- | ----------- | ------ | ------- | ------- |
| $\text{InternVideo2}_{s2}$-1B | MSRVTT | 51.9 | 50.9 | [script](scripts/evaluation/stage2/zero_shot/1B/eval_msrvtt.sh) |
| | LSMDC | 32.0 | 27.3 | [script](scripts/evaluation/stage2/zero_shot/1B/eval_lsmdc.sh) |
| | DiDeMo | 57.0 | 54.3 | [script](scripts/evaluation/stage2/zero_shot/1B/eval_didemo.sh) |
| | MSVD | 58.1 | 83.3 | [script](scripts/evaluation/stage2/zero_shot/1B/eval_msvd.sh) |
| | ANet | 60.4 | 54.8 | [script](scripts/evaluation/stage2/zero_shot/1B/eval_anet.sh) |
| | VATEX | 70.4 | 85.4 | [script](scripts/evaluation/stage2/zero_shot/1B/eval_vatex.sh) |
| $\text{InternVideo2}_{s2}$-6B | MSRVTT | 55.9 | 53.7 | TBD |
| | LSMDC | 33.8 | 30.1 | TBD |
| | DiDeMo | 57.9 | 57.1 | TBD |
| | MSVD | 59.3 | 83.1 | TBD |
| | ANet | 63.2 | 56.5 | TBD |
| | VATEX | 71.5 | 85.3 | TBD |
| Model | Dataset | T2V | V2T | Evaluation Script |
| -------- | ----------- | ------ | ------- | ------- |
| $\text{InternVideo2}_{clip}$-1B | MSRVTT | 50.0 | 48.4 | [script](scripts/evaluation/clip/zero_shot/1B/eval_msrvtt.sh) |
| | LSMDC | 26.4 | 23.1 | [script](scripts/evaluation/clip/zero_shot/1B/eval_lsmdc.sh) |
| | DiDeMo | 47.8 | 46.4 | [script](scripts/evaluation/clip/zero_shot/1B/eval_didemo.sh) |
| | ANet | 49.4 | 46.2 | [script](scripts/evaluation/clip/zero_shot/1B/eval_anet.sh) |
| | VATEX_en | 63.5 | 81.2 | [script](scripts/evaluation/clip/zero_shot/1B/eval_vatex_en.sh) |
| | VATEX_ch | 54.9 | 76.4 | [script](scripts/evaluation/clip/zero_shot/1B/eval_vatex_ch.sh) |
| $\text{InternVideo2}_{clip}$-6B | MSRVTT | 50.9 | 50.6 | [script](scripts/evaluation/clip/zero_shot/6B/eval_msrvtt.sh) |
| | LSMDC | 29.4 | 26.3 | [script](scripts/evaluation/clip/zero_shot/6B/eval_lsmdc.sh) |
| | DiDeMo | 50.5 | 46.8| [script](scripts/evaluation/clip/zero_shot/6B/eval_didemo.sh) |
| | ANet | 50.2 | 47.5 | [script](scripts/evaluation/clip/zero_shot/6B/eval_anet.sh) |
| | VATEX_en | 64.1 | 82.6 | [script](scripts/evaluation/clip/zero_shot/6B/eval_vatex_en.sh) |
| | VATEX_ch | 54.6 | 76.9 | [script](scripts/evaluation/clip/zero_shot/6B/eval_vatex_ch.sh) |
## Zero-Shot Action Recognition
| Model | Dataset | top-1 | AVG | Script |
| -------- | ----------- | ------ | ------- | ------- |
| $\text{InternVideo2}_{clip}$-1B | K400 | 73.1 | 82.4 | [script](scripts/evaluation/clip/zero_shot/1B/eval_k400.sh) |
| | K600 | 72.8 | 81.8 | [script](scripts/evaluation/clip/zero_shot/1B/eval_k600.sh) |
| | K700 | 64.9 | 75.2 | [script](scripts/evaluation/clip/zero_shot/1B/eval_k700.sh) |
| | UCF101 | 88.8 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_ucf101.sh) |
| | HMDB51 | 53.9 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_hmdb51.sh) |
| | MiT | 31.6 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_mit.sh) |
| | SSv2-MC | 61.5 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_ssv2_mc.sh) |
| $\text{InternVideo2}_{clip}$-6B | K400 | 72.7 | 82.2 | [script](scripts/evaluation/clip/zero_shot/1B/eval_k400.sh) |
| | K600 | 71.7 | 81.2 | [script](scripts/evaluation/clip/zero_shot/1B/eval_k600.sh) |
| | K700 | 64.2 | 75.2 | [script](scripts/evaluation/clip/zero_shot/1B/eval_k700.sh) |
| | UCF101 | 89.5 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_ucf101.sh) |
| | HMDB51 | 56.7 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_hmdb51.sh) |
| | MiT | 32.9 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_mit.sh) |
| | SSv2-MC | 63.5 | - | [script](scripts/evaluation/clip/zero_shot/1B/eval_ssv2_mc.sh) |
| Model | Dataset | mAP | Script |
| -------- | ----------- | ------ | ------- |
| $\text{InternVideo2}_{clip}$-1B | Charades | 32.9 | [script](scripts/evaluation/clip/zero_shot/1B/eval_charades_mc.sh) |
| $\text{InternVideo2}_{clip}$-6B | Charades | 34.6 | [script](scripts/evaluation/clip/zero_shot/6B/eval_charades_mc.sh) | |