Diangle commited on
Commit
7840dd2
1 Parent(s): 468f4bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -9
README.md CHANGED
@@ -10,12 +10,11 @@ pipeline_tag: text-to-video
10
 
11
  # Model Card
12
  ## Details
 
13
 
14
- This model was trained via CLIP4Clip (a CLIP-based a CLIP-based video retrival method, based on this [paper](https://arxiv.org/pdf/2104.08860.pdf) and [code](https://github.com/ArrowLuo/CLIP4Clip).
15
- This model was trained on 150k videos from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/) (a large-scale dataset of short videos with textual descriptions sourced from the web).
16
-
17
- We adjucted the weights of the clip model we achieved from our training to the model implameted in [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) and added few changes for the last layers.
18
 
 
19
 
20
  ### Use with Transformers
21
 
@@ -35,7 +34,7 @@ inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
35
 
36
  outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
37
 
38
- # Adding special projection and changing last layers:
39
  text_projection = model.state_dict()['text_projection.weight']
40
  text_embeds = outputs[1] @ text_projection
41
  final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
@@ -71,8 +70,4 @@ We have evaluated the performance
71
  ## Limitations
72
 
73
 
74
- ## Feedback
75
-
76
- ### Where to send questions or comments about the model
77
 
78
- Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)
 
10
 
11
  # Model Card
12
  ## Details
13
+ This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
14
 
15
+ The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
 
 
 
16
 
17
+ To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.
18
 
19
  ### Use with Transformers
20
 
 
34
 
35
  outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)
36
 
37
+ # Special projection and changing last layers:
38
  text_projection = model.state_dict()['text_projection.weight']
39
  text_embeds = outputs[1] @ text_projection
40
  final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]
 
70
  ## Limitations
71
 
72
 
 
 
 
73