--- tags: - text - vision - video datasets: - HuggingFaceM4/webvid pipeline_tag: text-to-video --- # Model Card ## Details This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip). The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web. In order to integrate the trained clip model into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), we have made modifications to the weights. ### Use with Transformers ### Extracting Text Embeddings: ```python import numpy as np import torch from transformers import CLIPTokenizer, CLIPTextModelWithProjection search_sentence = "a basketball player performing a slam dunk" model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid") tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid") inputs = tokenizer(text=search_sentence , return_tensors="pt") outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]) # Normalizing the embeddings: final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True) final_output = final_output.cpu().detach().numpy() print("sequence_output: ", sequence_output) ``` ### Extracting Video Embeddings: An additional [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb) is available that provides instructions on how to perform video embedding. ## Model Intended Use This model is intended to use for video retrieval, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid). ## Performance We have evaluated the performance of differenet models on the last 10k video clips from Webvid database. | Model | R1 | R5 | R10 | MedianR | MeanR |------------------------|-------|-------|-------|-----|---------| | Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128 | CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023 | **CLIP4Clip trained on 150k Webvid** | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535 | Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964 For more information about the evaluation you can look at this [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval-Evaluation.ipynb).