metadata

tags:
  - vision
  - clip
  - clip4clip
  - video
  - retrieval
pipeline_tag: text-to-video

Model Card

Details

This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper here and implemented in the accompanying code.

The training process involved 150,000 videos obtained from the WebVid Dataset, a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.

To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of clip-vit-base-patch32, making certain modifications to the final layers.

Use with Transformers

import numpy as np
import torch
from transformers import AutoTokenizer, CLIPTextModelWithProjection


search_sentence = "a basketball player performing a slam dunk"

model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid")

inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)

# Special projection and changing last layers:      
text_projection = model.state_dict()['text_projection.weight']
text_embeds = outputs[1] @ text_projection
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]

# Normalizing the embeddings:
final_output = final_output / final_output.norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
print("sequence_output: ", sequence_output)

Model Use

Intended Use

This model is intended to use for video retrival, look for example this space.

Extra Information

We have For video embedding there is an extra notebook that describes how to embedd videos.

Performance and Limitations

Performance

We have evaluated the performance of differnet models on the last 10k video clips from Webvid database.

Model	R1	R5	R10	MedianR	MeanR
Zero-shot clip weights	37.16	62.10	71.16	3.0	42.2128
CLIP4Clip weights trained on msr-vtt	38.38	62.89	72.01	3.0	39.3023
CLIP4Clip trained on 150k Webvid (This model)	50.74	77.30	85.05	1.0	14.9535
Binarized CLIP4Clip trained on 150k Webvid with rerank100	50.56	76.39	83.51	1.0	43.2964

For more information about the evaluation you can look at this [notebook].