Update README.md
Browse files
README.md
CHANGED
@@ -10,12 +10,11 @@ pipeline_tag: text-to-video
|
|
10 |
|
11 |
# Model Card
|
12 |
## Details
|
|
|
13 |
|
14 |
-
|
15 |
-
This model was trained on 150k videos from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/) (a large-scale dataset of short videos with textual descriptions sourced from the web).
|
16 |
-
|
17 |
-
We adjucted the weights of the clip model we achieved from our training to the model implameted in [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) and added few changes for the last layers.
|
18 |
|
|
|
19 |
|
20 |
### Use with Transformers
|
21 |
|
@@ -35,7 +34,7 @@ inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
|
|
35 |
|
36 |
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
|
37 |
|
38 |
-
#
|
39 |
text_projection = model.state_dict()['text_projection.weight']
|
40 |
text_embeds = outputs[1] @ text_projection
|
41 |
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
|
@@ -71,8 +70,4 @@ We have evaluated the performance
|
|
71 |
## Limitations
|
72 |
|
73 |
|
74 |
-
## Feedback
|
75 |
-
|
76 |
-
### Where to send questions or comments about the model
|
77 |
|
78 |
-
Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)
|
|
|
10 |
|
11 |
# Model Card
|
12 |
## Details
|
13 |
+
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
|
14 |
|
15 |
+
The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
|
|
|
|
|
|
|
16 |
|
17 |
+
To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.
|
18 |
|
19 |
### Use with Transformers
|
20 |
|
|
|
34 |
|
35 |
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
|
36 |
|
37 |
+
# Special projection and changing last layers:
|
38 |
text_projection = model.state_dict()['text_projection.weight']
|
39 |
text_embeds = outputs[1] @ text_projection
|
40 |
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
|
|
|
70 |
## Limitations
|
71 |
|
72 |
|
|
|
|
|
|
|
73 |
|
|