---
tags:
- vision
- clip
- clip4clip
- video
- retrieval
pipeline_tag: text-to-video
---

# Model Card
## Details

This model was trained via CLIP4Clip (a CLIP-based a CLIP-based video retrival method, based on this [paper](https://arxiv.org/pdf/2104.08860.pdf) and [code](https://github.com/ArrowLuo/CLIP4Clip).
This model was trained on 150k videos from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/) (a large-scale dataset of short videos with textual descriptions sourced from the web).

We adjucted the weights of the clip model we achieved from our training to the model implameted in [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) and added few changes for the last layers.


### Use with Transformers

```python
import numpy as np
import torch
from transformers import AutoTokenizer, CLIPTextModelWithProjection


search_sentence = "a basketball player performing a slam dunk"

model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid")


inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)

outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)

# Adding special projection and changing last layers:      
text_projection = model.state_dict()['text_projection.weight']
text_embeds = outputs[1] @ text_projection
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]

# Normalizing the embeddings:
final_output = final_output / final_output.norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
print("sequence_output: ", sequence_output)
```


## Model Use

### Intended Use

This model is intended to use for video retrival, look for example **this space**. 

### Extra Information

For video embedding there is an extra notebook that describes how to embedd videos.


## Performance and Limitations

### Performance

We have evaluated the performance 


## Limitations


## Feedback

### Where to send questions or comments about the model

Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)