File size: 2,264 Bytes
b88cf0f 40b2001 468f4bb d0579e9 b88cf0f 40b2001 7840dd2 b88cf0f 7840dd2 b88cf0f 7840dd2 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 7840dd2 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f 40b2001 b88cf0f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
---
tags:
- vision
- clip
- clip4clip
- video
- retrieval
pipeline_tag: text-to-video
---
# Model Card
## Details
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.
### Use with Transformers
```python
import numpy as np
import torch
from transformers import AutoTokenizer, CLIPTextModelWithProjection
search_sentence = "a basketball player performing a slam dunk"
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid")
inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
# Special projection and changing last layers:
text_projection = model.state_dict()['text_projection.weight']
text_embeds = outputs[1] @ text_projection
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
# Normalizing the embeddings:
final_output = final_output / final_output.norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
print("sequence_output: ", sequence_output)
```
## Model Use
### Intended Use
This model is intended to use for video retrival, look for example **this space**.
### Extra Information
For video embedding there is an extra notebook that describes how to embedd videos.
## Performance and Limitations
### Performance
We have evaluated the performance
## Limitations
|