Update README.md
Browse files
README.md
CHANGED
@@ -11,6 +11,7 @@ pipeline_tag: text-to-video
|
|
11 |
|
12 |
# Model Card for CLIP4Clip/WebVid-150k
|
13 |
## Model Details
|
|
|
14 |
A CLIP4Clip video-text retrieval model trained on a subset of the WebVid dataset.
|
15 |
The model and training method are described in the paper ["Clip4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [GitHub repository](https://github.com/ArrowLuo/CLIP4Clip).
|
16 |
|
@@ -27,6 +28,33 @@ visual-temporal concepts from videos, thereby improving video-based searches.
|
|
27 |
By using the WebVid dataset, the model's capabilities were enhanced even beyond those described in the paper, thanks to the large-scale and diverse nature of the dataset empowering the model's performance.
|
28 |
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
## Model Intended Use
|
31 |
|
32 |
This model is intended for use in large scale video-text retrieval applications.
|
@@ -34,6 +62,7 @@ This model is intended for use in large scale video-text retrieval applications.
|
|
34 |
To illustrate its functionality, refer to the accompanying [**Video Search Space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) which provides a search demonstration on a vast collection of approximately 1.5 million videos.
|
35 |
This interactive demo showcases the model's capability to effectively retrieve videos based on text queries, highlighting its potential for handling substantial video datasets.
|
36 |
|
|
|
37 |
## Evaluations
|
38 |
|
39 |
To evaluate the model's performance we used the last last 10,000 video clips and their accompanying text from the Webvid dataset.
|
@@ -58,32 +87,6 @@ For an elaborate description of the evaluation refer to the notebook
|
|
58 |
<p>[1] For overall search acceleration capabilities, in order to boost you search application, please refer to searchium.ai</p>
|
59 |
</div>
|
60 |
|
61 |
-
### How to use
|
62 |
-
### Extracting Text Embeddings:
|
63 |
-
|
64 |
-
```python
|
65 |
-
import numpy as np
|
66 |
-
import torch
|
67 |
-
from transformers import CLIPTokenizer, CLIPTextModelWithProjection
|
68 |
-
|
69 |
-
|
70 |
-
search_sentence = "a basketball player performing a slam dunk"
|
71 |
-
|
72 |
-
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
|
73 |
-
tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
|
74 |
-
|
75 |
-
inputs = tokenizer(text=search_sentence , return_tensors="pt")
|
76 |
-
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
|
77 |
-
|
78 |
-
# Normalize embeddings for retrieval:
|
79 |
-
final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
|
80 |
-
final_output = final_output.cpu().detach().numpy()
|
81 |
-
print("sequence_output: ", sequence_output)
|
82 |
-
```
|
83 |
-
|
84 |
-
### Extracting Video Embeddings:
|
85 |
-
|
86 |
-
Due to a moderate level of complexity in extracting video embeddings, an example usage with utility functions are provided in the additional notebook [GSI_VideoRetrieval_VideoEmbedding.ipynb](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb).
|
87 |
|
88 |
## Acknowledgements
|
89 |
Acknowledging Diana Mazenko of [Searchium](https://www.searchium.ai) for adapting and loading the model to Hugging Face, and for creating a Hugging Face [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) for a large-scale video-search demo.
|
|
|
11 |
|
12 |
# Model Card for CLIP4Clip/WebVid-150k
|
13 |
## Model Details
|
14 |
+
|
15 |
A CLIP4Clip video-text retrieval model trained on a subset of the WebVid dataset.
|
16 |
The model and training method are described in the paper ["Clip4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [GitHub repository](https://github.com/ArrowLuo/CLIP4Clip).
|
17 |
|
|
|
28 |
By using the WebVid dataset, the model's capabilities were enhanced even beyond those described in the paper, thanks to the large-scale and diverse nature of the dataset empowering the model's performance.
|
29 |
|
30 |
|
31 |
+
### How to use
|
32 |
+
### Extracting Text Embeddings:
|
33 |
+
|
34 |
+
```python
|
35 |
+
import numpy as np
|
36 |
+
import torch
|
37 |
+
from transformers import CLIPTokenizer, CLIPTextModelWithProjection
|
38 |
+
|
39 |
+
|
40 |
+
search_sentence = "a basketball player performing a slam dunk"
|
41 |
+
|
42 |
+
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
|
43 |
+
tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
|
44 |
+
|
45 |
+
inputs = tokenizer(text=search_sentence , return_tensors="pt")
|
46 |
+
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
|
47 |
+
|
48 |
+
# Normalize embeddings for retrieval:
|
49 |
+
final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
|
50 |
+
final_output = final_output.cpu().detach().numpy()
|
51 |
+
print("sequence_output: ", sequence_output)
|
52 |
+
```
|
53 |
+
|
54 |
+
### Extracting Video Embeddings:
|
55 |
+
|
56 |
+
An additional [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb) is available that provides instructions on how to perform video embedding.
|
57 |
+
|
58 |
## Model Intended Use
|
59 |
|
60 |
This model is intended for use in large scale video-text retrieval applications.
|
|
|
62 |
To illustrate its functionality, refer to the accompanying [**Video Search Space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) which provides a search demonstration on a vast collection of approximately 1.5 million videos.
|
63 |
This interactive demo showcases the model's capability to effectively retrieve videos based on text queries, highlighting its potential for handling substantial video datasets.
|
64 |
|
65 |
+
|
66 |
## Evaluations
|
67 |
|
68 |
To evaluate the model's performance we used the last last 10,000 video clips and their accompanying text from the Webvid dataset.
|
|
|
87 |
<p>[1] For overall search acceleration capabilities, in order to boost you search application, please refer to searchium.ai</p>
|
88 |
</div>
|
89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
## Acknowledgements
|
92 |
Acknowledging Diana Mazenko of [Searchium](https://www.searchium.ai) for adapting and loading the model to Hugging Face, and for creating a Hugging Face [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) for a large-scale video-search demo.
|