|
--- |
|
tags: |
|
- vision |
|
- clip |
|
- clip4clip |
|
- video |
|
- retrieval |
|
pipeline_tag: text-to-video |
|
--- |
|
|
|
# Model Card |
|
## Details |
|
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip). |
|
|
|
The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web. |
|
|
|
To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers. |
|
|
|
### Use with Transformers |
|
|
|
```python |
|
import numpy as np |
|
import torch |
|
from transformers import AutoTokenizer, CLIPTextModelWithProjection |
|
|
|
|
|
search_sentence = "a basketball player performing a slam dunk" |
|
|
|
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid") |
|
tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid") |
|
|
|
|
|
inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True) |
|
|
|
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False) |
|
|
|
# Special projection and changing last layers: |
|
text_projection = model.state_dict()['text_projection.weight'] |
|
text_embeds = outputs[1] @ text_projection |
|
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)] |
|
|
|
# Normalizing the embeddings: |
|
final_output = final_output / final_output.norm(dim=-1, keepdim=True) |
|
final_output = final_output.cpu().detach().numpy() |
|
sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True) |
|
print("sequence_output: ", sequence_output) |
|
``` |
|
|
|
|
|
## Model Use |
|
|
|
### Intended Use |
|
|
|
This model is intended to use for video retrival, look for example **this space**. |
|
|
|
### Extra Information |
|
|
|
For video embedding there is an extra notebook that describes how to embedd videos. |
|
|
|
|
|
|
|
## Performance and Limitations |
|
|
|
### Performance |
|
|
|
We have evaluated the performance |
|
|
|
|
|
|
|
## Limitations |
|
|
|
|
|
|
|
|