Searchium-ai
/

clip4clip-webvid150k

zero-shot-image-classification

Inference Endpoints

Model card Files Files and versions Community

Diangle commited on Jun 5, 2023

Commit

7840dd2

•

1 Parent(s): 468f4bb

Update README.md

Files changed (1) hide show

README.md +4 -9

README.md CHANGED Viewed

@@ -10,12 +10,11 @@ pipeline_tag: text-to-video
 # Model Card
 ## Details
-This model was trained via CLIP4Clip (a CLIP-based a CLIP-based video retrival method, based on this [paper](https://arxiv.org/pdf/2104.08860.pdf) and [code](https://github.com/ArrowLuo/CLIP4Clip).
-This model was trained on 150k videos from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/) (a large-scale dataset of short videos with textual descriptions sourced from the web).
-We adjucted the weights of the clip model we achieved from our training to the model implameted in [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) and added few changes for the last layers.
 ### Use with Transformers
@@ -35,7 +34,7 @@ inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
 outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
-# Adding special projection and changing last layers:
 text_projection = model.state_dict()['text_projection.weight']
 text_embeds = outputs[1] @ text_projection
 final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
@@ -71,8 +70,4 @@ We have evaluated the performance
 ## Limitations
-## Feedback
-### Where to send questions or comments about the model
-Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)

 # Model Card
 ## Details
+This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
+The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
+To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.
 ### Use with Transformers
 outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
+# Special projection and changing last layers:
 text_projection = model.state_dict()['text_projection.weight']
 text_embeds = outputs[1] @ text_projection
 final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
 ## Limitations