tags:
- text
- vision
- video
datasets:
- HuggingFaceM4/webvid
pipeline_tag: text-to-video
Model Card
Details
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval by Lou et el, and implemented in the accompanying code.
The training process involved 150,000 videos obtained from the WebVid Dataset, a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
In order to integrate the trained clip model into the implementation of clip-vit-base-patch32, we have made modifications to the weights.
Use with Transformers
Extracting Text Embeddings:
import numpy as np
import torch
from transformers import CLIPTokenizer, CLIPTextModelWithProjection
search_sentence = "a basketball player performing a slam dunk"
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
inputs = tokenizer(text=search_sentence , return_tensors="pt")
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
# Normalizing the embeddings:
final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
print("sequence_output: ", sequence_output)
Extracting Video Embeddings:
An additional notebook is available that provides instructions on how to perform video embedding.
Model Intended Use
This model is intended to use for video retrieval, look for example this SPACE.
Performance
We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.
Model | R1 | R5 | R10 | MedianR | MeanR |
---|---|---|---|---|---|
Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128 |
CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 | 39.3023 |
CLIP4Clip trained on 150k Webvid | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535 |
Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964 |
For more information about the evaluation you can look at this notebook.