Spaces:
Runtime error
Runtime error
File size: 16,363 Bytes
39d5658 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
# LAVILA Model Zoo
## Multi-node Training
We use multi-node training on a SLURM cluster with [submitit](https://github.com/facebookincubator/submitit) for producing the results and models in the paper.
Please install `submitit` in your conda environment:
```bash
pip install submitit
```
## Pre-training
Please refer to [PRETRAIN.md](./PRETRAIN.md).
## Narrator
| Visual Encoder | Text Decoder | METEOR | ROUGE-L | CIDEr | Pre-trained<br>Vis. Encoder (md5) | checkpoint (md5) |
| :------------: | :----------: | :----: | :-----: | :---: | :-------------------------------: | :--------: |
| TSF-B | GPT-2 | 0.282 | 0.517 | 0.833 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.baseline.ep_0003.pth) (dbcc4d) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/narrator/vclm_openai_timesformer_base_gpt2_base.pt_ego4d.jobid_319630.ep_0002.md5sum_68a71f.pth) (68a71f) |
| TSF-L@HR | GPT-2 XL | 0.298 | 0.539 | 0.977 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_large_336px_distilbert_base.baseline.ep_0003.pth) (5c69b8) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/narrator/vclm_openai_timesformer_large_336px_gpt2_xl.pt_ego4d.jobid_246897.ep_0003.md5sum_443263.pth) (443263) |
<details><summary>Ego4D val split</summary>
<p>
```bash
torchrun --nproc_per_node=1 \
eval_narrator.py \
--caption-top-p 0.95 --caption-temperature 0.7 \
--eval-freq 10000 \
--resume $CHECKPOINT
```
</p></details>
## Zero-shot
<div class="table-wrapper" markdown="block">
| | Backbone | EK-100 MIR<br>avg. mAP | EK-100 MIR<br>avg. nDCG | Charades-Ego<br>mAP^ | EGTEA<br> mean acc. | EgoMCQ<br>intra-video acc. | checkpoint |
| :----------: | :------: | :--------------------: | :---------------------: | :------------------: | :-----------------: | :------------------------: | :----------: |
| Prev. SOTA^^ | TSF-B | 22.1/23.3 | 22.1/27.9 | 25.2 | 17.6 | 57.2 | [Epoch 1](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/egovlp_epo1_converted_f16.md5sum_7a3d3b.pth), [best epoch](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/egovlp_converted_f16.md5sum_c33363.pth) |
| LAVILA | TSF-B | 29.7/30.9 | 31.5/32.0 | 26.8 | 28.9 | 59.9 | [Epoch 1](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.narrator_rephraser.ep_0001.md5sum_02dbb9.pth)^, [Epoch 5](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.narrator_rephraser.ep_0005.md5sum_d73a9c.pth) |
| LAVILA | TSF-L | 35.0/36.1 | 34.2/34.6 | 28.9 | 34.1 | 63.1 | [Epoch 1](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_large.narrator_rephraser.ep_0001.md5sum_9a25de.pth)^, [Epoch 3](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_large.narrator_rephraser.ep_0003.md5sum_c89337.pth) |
</div>
^ Note that the pre-trained checkpoint to evaluate CharadesEgo is different from that to evalute other datasets.
Specifically, we use the checkpoint at epoch 1 to zero-shot evaluate CharadesEgo and the checkpoint that achieves best average mAP on EK-100 MIR to evaluate other datasets, as is done in [EgoVLP](https://arxiv.org/pdf/2206.01670.pdf).
Our guess is that since CharadesEgo videos (captured by head-mounted mobile cameras) are visually different from Ego4D/EPIC-Kitchens videos (captured by professional action cameras, eg GoPro), pre-training on Ego4D videos for longer will lead to some potential domain discrepancy.
^^ We use the checkpoints released by [EgoVLP](https://github.com/showlab/EgoVLP) and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).
<details><summary>1. EK-100 MIR</summary>
<p>
```bash
python eval_zeroshot.py --dataset ek100_mir --root datasets/EK100/video_ht256px/ --clip-length 4 --resume $PATH
```
By increasing the number of frames per clip, eg `--clip-length 16`, you are expected to see a better performance.
</p></details>
<details><summary>2. EK-100 CLS</summary>
<p>
```bash
python eval_zeroshot.py --dataset ek100_cls --metadata-val datasets/EK100/epic-kitchens-100-annotations/EPIC_100_validation.csv --resume $PATH
```
</p></details>
<details><summary>3. Charades-Ego</summary>
<p>
```bash
python eval_zeroshot.py --dataset charades_ego --metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv --root datasets/CharadesEgo/CharadesEgo_v1_480/ --clip-length 16 --sparse-sample --resume $PATH
```
</p></details>
<details><summary>4. EGTEA</summary>
<p>
```bash
python eval_zeroshot.py --dataset egtea --metadata-val datasets/EGTEA/test_split1.txt --root datasets/EGTEA/cropped_clips/ --clip-length 16 --clip-stride 2 --num-crops 3 --num-clips 10 --resume $PATH
```
</p></details>
<details><summary>5. EgoMCQ</summary>
<p>
```bash
python eval_zeroshot.py --dataset ego4d_mcq --metadata-val datasets/Ego4D/egomcq.json --root datasets/Ego4D/video_5min_chunks_288px/ --clip-length 4 --resume $PATH --use-half -j 4
```
</p></details>
## Fine-tuned
### EK-100 MIR
<div class="table-wrapper" markdown="block">
| | Backbone | avg mAP | avg nDCG | Pretrain (md5) | Fine-tuned checkpoint | training log |
| :----: | :-------:| :-----: | :------: | :----------: | :-------------------: | :----------: |
| LAVILA | TSF-B | 50.5 | 65.0 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.narrator_rephraser.ep_0005.md5sum_d73a9c.pth) (d73a9c) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_mir/clip_openai_timesformer_base.ft_ek100_mir.ep_0085.md5sum_c67d95.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_mir/clip_openai_timesformer_base.ft_ek100_mir.jobid_57361.log) |
| LAVILA | TSF-L | 50.9 | 66.5 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_large.narrator_rephraser.ep_0003.md5sum_c89337.pth) (c89337) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_mir/clip_openai_timesformer_large.ft_ek100_mir.ep_0095.md5sum_bd508b.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_mir/clip_openai_timesformer_large.ft_ek100_mir.jobid_56606.log) |
</div>
<details><summary>Training and evaluating scripts</summary>
<p>
### Multi-node training (Slurm)
```bash
# TimeSformer-Base
python run_with_submitit_finetune_retrieval.py \
--pretrain-model $PATH \
--use-checkpoint --nodes 4
# TimeSformer-Large
python run_with_submitit_finetune_retrieval.py \
--pretrain-model $PATH \
--batch-size 4 \
--use-checkpoint --nodes 4
```
### Single-machine training
```bash
torchrun --nproc_per_node=8 \
main_finetune_retrieval.py \
--output-dir $OUT_DIR \
--pretrain-model $PATH \
--use-checkpoint
```
Note that you might see a slight drop of performance when training on a single node compared to multiple nodes (everything else being the same) because of a smaller total batch size.
### Evaluation
Evaluation is done every `--eval-freq 5` epochs by default during fine-tuning.
If you want to evaluate any checkpoint after fine-tuning, please switch to `--evaluate` mode and specify the path to the checkpoint by `--resume $FINETUNED_CHECKPOINT`.
```bash
torchrun --nproc_per_node=1 \
main_finetune_retrieval.py \
--output-dir $OUT_DIR \
--pretrain-model $PATH \
--use-checkpoint \
--evaluate \
--resume $FINETUNED_CHECKPOINT
```
</p></details>
### CharadesEgo
<div class="table-wrapper" markdown="block">
| | Backbone | video mAP |Pretrain^ (md5) | Fine-tuned checkpoint | training log |
| :----: | :-------:| :------: | :-------: | :-------------------: | :----------: |
| LAVILA | TSF-B | 33.7 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.narrator_rephraser.ep_0001.md5sum_02dbb9.pth) (02dbb9) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/charades_ego/clip_openai_timesformer_base.ft_charades_ego.ep_0005.md5sum_39bf4b.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/charades_ego/clip_openai_timesformer_base.ft_charades_ego.jobid_65760.log) |
| LAVILA | TSF-L | 36.1 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_large.narrator_rephraser.ep_0001.md5sum_9a25de.pth) (9a25de) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/charades_ego/clip_openai_timesformer_large.ft_charades_ego.ep_0003.md5sum_9448b2.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/charades_ego/clip_openai_timesformer_large.ft_charades_ego.jobid_65760.log) |
</div>
^ Note that the pre-trained checkpoint for fine-tuning CharadesEgo is different from that for fine-tuning EK-100 or EGTEA. Same reason stated above.
<details><summary>Training and evaluating scripts</summary>
<p>
### Multi-node training (Slurm)
```bash
# TimeSformer-Base
python run_with_submitit_finetune_retrieval.py \
--dataset charades_ego \
--metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
--metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
--root datasets/CharadesEgo/CharadesEgo_v1_480/ \
--epochs 10 \
--save-freq 1 --eval-freq 1 \
--sparse-sample \
--pretrain-model $PATH \
--use-checkpoint --nodes 4
# TimeSformer-Large
python run_with_submitit_finetune_retrieval.py \
--dataset charades_ego \
--metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
--metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
--root datasets/CharadesEgo/CharadesEgo_v1_480/ \
--epochs 10 \
--save-freq 1 --eval-freq 1 \
--sparse-sample \
--pretrain-model $PATH \
--batch-size 4 \
--use-checkpoint --nodes 4
```
### Evaluation
```bash
torchrun --nproc_per_node=1 \
main_finetune_retrieval.py \
--dataset charades_ego \
--metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
--metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
--root datasets/CharadesEgo/CharadesEgo_v1_480/ \
--output-dir $OUT_DIR \
--sparse-sample \
--pretrain-model $PATH \
--evaluate \
--resume $FINETUNED_CHECKPOINT
```
</p></details>
### EK-100 CLS
<div class="table-wrapper" markdown="block">
| | Backbone | V+N+A multi-head | Verb top-1 | Noun top-1 | Action top-1 | Pretrain (md5) | Fine-tuned checkpoint | training log |
| :----: | :-------:| :--------------: | :--------: | :--------: | :---------: | :------------: | :-------------------: | :----------: |
| LAVILA | TSF-B | no | 67.7 | 56.7 | 46.2 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.narrator_rephraser.ep_0005.md5sum_d73a9c.pth) (d73a9c) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_cls/clip_openai_timesformer_base.ft_ek100_cls.single_head.ep_0100.md5sum_e8aa0c.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_cls/clip_openai_timesformer_base.ft_ek100_cls.single_head.jobid_73363.log) |
| LAVILA | TSF-B | yes | 69.0 | 58.4 | 46.9 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.narrator_rephraser.ep_0005.md5sum_d73a9c.pth) (d73a9c) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_cls/clip_openai_timesformer_base.ft_ek100_cls.ep_0100.md5sum_4e3575.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_cls/clip_openai_timesformer_base.ft_ek100_cls.jobid_73361.log) |
| LAVILA | TSF-L | yes | 72.0 | 62.9 | 51.0 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_large.narrator_rephraser.ep_0003.md5sum_c89337.pth) (c89337) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_cls/clip_openai_timesformer_large.ft_ek100_cls.ep_0090.md5sum_4a2509.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ek100_cls/clip_openai_timesformer_large.ft_ek100_cls.jobid_74016.log) |
</div>
<details><summary>Training and evaluating scripts</summary>
<p>
### Multi-node training (Slurm)
```bash
# TimeSformer-Base
python run_with_submitit_finetune_classification.py \
--pretrain-model $PATH \
--use-vn-classifier --num-classes 97 300 3806 \
--use-sgd --wd 4e-5 --lr-multiplier-on-backbone 0.1 \
--use-checkpoint --node 1
# TimeSformer-Large
python run_with_submitit_finetune_classification.py \
--pretrain-model $PATH \
--use-vn-classifier --num-classes 97 300 3806 \
--use-sgd --wd 4e-5 --lr-multiplier-on-backbone 0.1 \
--use-checkpoint --node 4
```
</p></details>
### EGTEA
<div class="table-wrapper" markdown="block">
| | Backbone | mean Acc. | Pretrain (md5) | Fine-tuned checkpoint | training log |
| :----: | :-------:| :-------: | :------: | :-------------------: | :----------: |
| LAVILA | TSF-B | 70.12 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.narrator_rephraser.ep_0005.md5sum_d73a9c.pth) (d73a9c) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/egtea/clip_openai_timesformer_base.ft_egtea.ep_0090.md5sum_3b1faf.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/egtea/clip_openai_timesformer_base.ft_egtea.jobid_73358.log) |
| LAVILA | TSF-L | 76.00 | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_large.narrator_rephraser.ep_0003.md5sum_c89337.pth) (c89337) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/egtea/clip_openai_timesformer_large.ft_egtea.ep_0095.md5sum_a5ba17.pth) | [download](https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/egtea/clip_openai_timesformer_large.ft_egtea.jobid_74026.log) |
</div>
<details><summary>Training and evaluating scripts</summary>
<p>
```bash
# TimeSformer-Base
python run_with_submitit_finetune_classification.py \
--dataset egtea \
--metadata-train datasets/EGTEA/train_split1.txt \
--metadata-val datasets/EGTEA/test_split1.txt \
--root datasets/EGTEA/cropped_clips/ \
--pretrain-model $PATH \
--num-classes 106 \
--use-sgd --wd 4e-5 \
--use-checkpoint --node 1
# TimeSformer-Large
python run_with_submitit_finetune_classification.py \
--dataset egtea \
--metadata-train datasets/EGTEA/train_split1.txt \
--metadata-val datasets/EGTEA/test_split1.txt \
--root datasets/EGTEA/cropped_clips/ \
--pretrain-model $PATH \
--num-classes 106 \
--use-sgd --wd 4e-5 \
--batch-size 4 \
--use-checkpoint --node 4
```
### Evaluation
```bash
torchrun --nproc_per_node=1 \
main_finetune_classification.py \
--dataset egtea \
--metadata-train datasets/EGTEA/train_split1.txt \
--metadata-val datasets/EGTEA/test_split1.txt \
--root datasets/EGTEA/cropped_clips/ \
--output-dir $OUT_DIR \
--pretrain-model $PATH \
--num-classes 106 \
--use-sgd --wd 4e-5 \
--evaluate \
--resume $FINETUNED_CHECKPOINT \
--num-crops 3 --num-clips 10 \
--use-half
```
</p></details>
|