---
tags:
- text
- vision
- video
datasets:
- HuggingFaceM4/webvid
pipeline_tag: text-to-video
---


# Model Card
## Details
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip). 

The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.

In order to integrate the trained clip model into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), we have made modifications to the weights.


### Use with Transformers
### Extracting Text Embeddings:

```python
import numpy as np
import torch
from transformers import CLIPTokenizer, CLIPTextModelWithProjection


search_sentence = "a basketball player performing a slam dunk"

model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")

inputs = tokenizer(text=search_sentence , return_tensors="pt")
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])

# Normalizing the embeddings:
final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
print("sequence_output: ", sequence_output)
```

### Extracting Video Embeddings: 

An additional [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb) is available that provides instructions on how to perform video embedding.


## Model Intended Use

This model is intended to use for video retrieval, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid). 


## Performance

We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.

| Model | R1 | R5 | R10 | MedianR | MeanR
|------------------------|-------|-------|-------|-----|---------|
| Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128
| CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023 
| **CLIP4Clip trained on 150k Webvid** | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535
| Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964

For more information about the evaluation you can look at this [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval-Evaluation.ipynb).