Searchium-ai
/

clip4clip-webvid150k

@@ -11,6 +11,7 @@ pipeline_tag: text-to-video
 # Model Card for CLIP4Clip/WebVid-150k
 ## Model Details
 A CLIP4Clip video-text retrieval model trained on a subset of the WebVid dataset.
 The model and training method are described in the paper ["Clip4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [GitHub repository](https://github.com/ArrowLuo/CLIP4Clip).
@@ -27,6 +28,33 @@ visual-temporal concepts from videos, thereby improving video-based searches.
 By using the WebVid dataset, the model's capabilities were enhanced even beyond those described in the paper, thanks to the large-scale and diverse nature of the dataset empowering the model's performance.
 ## Model Intended Use
 This model is intended for use in large scale video-text retrieval applications.
@@ -34,6 +62,7 @@ This model is intended for use in large scale video-text retrieval applications.
 To illustrate its functionality, refer to the accompanying [**Video Search Space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) which provides a search demonstration on a vast collection of approximately 1.5 million videos.
 This interactive demo showcases the model's capability to effectively retrieve videos based on text queries, highlighting its potential for handling substantial video datasets.
 ## Evaluations
 To evaluate the model's performance we used the last last 10,000 video clips and their accompanying text from the Webvid dataset.
@@ -58,32 +87,6 @@ For an elaborate description of the evaluation refer to the notebook
 <p>[1] For overall search acceleration capabilities, in order to boost you search application, please refer to searchium.ai</p>
 </div>
-### How to use
-### Extracting Text Embeddings:
-```python
-import numpy as np
-import torch
-from transformers import CLIPTokenizer, CLIPTextModelWithProjection
-search_sentence = "a basketball player performing a slam dunk"
-model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
-tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
-inputs = tokenizer(text=search_sentence , return_tensors="pt")
-outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
-# Normalize embeddings for retrieval:
-final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
-final_output = final_output.cpu().detach().numpy()
-print("sequence_output: ", sequence_output)
-```
-### Extracting Video Embeddings:
-Due to a moderate level of complexity in extracting video embeddings, an example usage with utility functions are provided in the additional notebook [GSI_VideoRetrieval_VideoEmbedding.ipynb](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb).
 ## Acknowledgements
 Acknowledging Diana Mazenko of [Searchium](https://www.searchium.ai) for adapting and loading the model to Hugging Face, and for creating a Hugging Face [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) for a large-scale video-search demo.

 # Model Card for CLIP4Clip/WebVid-150k
 ## Model Details
 A CLIP4Clip video-text retrieval model trained on a subset of the WebVid dataset.
 The model and training method are described in the paper ["Clip4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [GitHub repository](https://github.com/ArrowLuo/CLIP4Clip).
 By using the WebVid dataset, the model's capabilities were enhanced even beyond those described in the paper, thanks to the large-scale and diverse nature of the dataset empowering the model's performance.
+### How to use
+### Extracting Text Embeddings:
+```python
+import numpy as np
+import torch
+from transformers import CLIPTokenizer, CLIPTextModelWithProjection
+search_sentence = "a basketball player performing a slam dunk"
+model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
+tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
+inputs = tokenizer(text=search_sentence , return_tensors="pt")
+outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
+# Normalize embeddings for retrieval:
+final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
+final_output = final_output.cpu().detach().numpy()
+print("sequence_output: ", sequence_output)
+```
+### Extracting Video Embeddings:
+An additional [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb) is available that provides instructions on how to perform video embedding.
 ## Model Intended Use
 This model is intended for use in large scale video-text retrieval applications.
 To illustrate its functionality, refer to the accompanying [**Video Search Space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) which provides a search demonstration on a vast collection of approximately 1.5 million videos.
 This interactive demo showcases the model's capability to effectively retrieve videos based on text queries, highlighting its potential for handling substantial video datasets.
 ## Evaluations
 To evaluate the model's performance we used the last last 10,000 video clips and their accompanying text from the Webvid dataset.
 <p>[1] For overall search acceleration capabilities, in order to boost you search application, please refer to searchium.ai</p>
 </div>
 ## Acknowledgements
 Acknowledging Diana Mazenko of [Searchium](https://www.searchium.ai) for adapting and loading the model to Hugging Face, and for creating a Hugging Face [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) for a large-scale video-search demo.