Model Zoo
Pretraining
For $\text{InternVideo2}{s2}$, we load those models of $\text{InternVideo2}{s1}$ and further pretrain them on multi-modality datasets.
For $\text{InternVideo2}{clip}$, we load those models of $\text{InternVideo2}{s2}$.
Model |
Setting |
Model |
Pretraining Script |
$\text{InternVideo2}_{s2}$-1B |
IV-25.5M |
:hugs: HF link |
script |
$\text{InternVideo2}_{clip}$-1B |
IV-25.5M |
TBD |
script |
$\text{InternVideo2}_{s2}$-6B |
IV-400M |
TBD |
script |
$\text{InternVideo2}_{clip}$-6B |
IV-400M |
TBD |
script |
Zero-shot Evaluation
Zero-Shot Video-Text Retrieval
Model |
Dataset |
T2V |
V2T |
Evaluation Script |
$\text{InternVideo2}_{s2}$-1B |
MSRVTT |
51.9 |
50.9 |
script |
|
LSMDC |
32.0 |
27.3 |
script |
|
DiDeMo |
57.0 |
54.3 |
script |
|
MSVD |
58.1 |
83.3 |
script |
|
ANet |
60.4 |
54.8 |
script |
|
VATEX |
70.4 |
85.4 |
script |
$\text{InternVideo2}_{s2}$-6B |
MSRVTT |
55.9 |
53.7 |
TBD |
|
LSMDC |
33.8 |
30.1 |
TBD |
|
DiDeMo |
57.9 |
57.1 |
TBD |
|
MSVD |
59.3 |
83.1 |
TBD |
|
ANet |
63.2 |
56.5 |
TBD |
|
VATEX |
71.5 |
85.3 |
TBD |
Model |
Dataset |
T2V |
V2T |
Evaluation Script |
$\text{InternVideo2}_{clip}$-1B |
MSRVTT |
50.0 |
48.4 |
script |
|
LSMDC |
26.4 |
23.1 |
script |
|
DiDeMo |
47.8 |
46.4 |
script |
|
ANet |
49.4 |
46.2 |
script |
|
VATEX_en |
63.5 |
81.2 |
script |
|
VATEX_ch |
54.9 |
76.4 |
script |
$\text{InternVideo2}_{clip}$-6B |
MSRVTT |
50.9 |
50.6 |
script |
|
LSMDC |
29.4 |
26.3 |
script |
|
DiDeMo |
50.5 |
46.8 |
script |
|
ANet |
50.2 |
47.5 |
script |
|
VATEX_en |
64.1 |
82.6 |
script |
|
VATEX_ch |
54.6 |
76.9 |
script |
Zero-Shot Action Recognition
Model |
Dataset |
mAP |
Script |
$\text{InternVideo2}_{clip}$-1B |
Charades |
32.9 |
script |
$\text{InternVideo2}_{clip}$-6B |
Charades |
34.6 |
script |