VD-IT / README.md
nielsr's picture
nielsr HF staff
Add tag
6db9162 verified
|
raw
history blame
817 Bytes
metadata
license: ecl-2.0
tags:
  - referring-video-object-segmentation

VD-IT model

The is our pre-trained checkpoint for our paper Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation.

We use a video diffusion model (ModelScopeT2V) as our base model, applying prompt tuning to adapt it as a visual backbone for downstream video understanding tasks.

Model traning

We first pre-train our model on Ref-COCO and then fine-tune it on Ref-YouTube-VOS. The training of the models utilizes two NVIDIA A100 GPUs, processing 5 frames per clip over the course of 9 epochs. The initial learning rate is set to 5e-5 and reduced by a factor of 10 at the 6th and 8th epochs.