nielsr HF staff commited on
Commit
219d536
1 Parent(s): 58ac4df

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - vision
6
+ - video-classification
7
+ model-index:
8
+ - name: nielsr/xclip-base-patch16-hmdb-8-shot
9
+ results:
10
+ - task:
11
+ type: video-classification
12
+ dataset:
13
+ name: HMDB-51
14
+ type: hmdb-51
15
+ metrics:
16
+ - type: top-1 accuracy
17
+ value: 62.8
18
+ ---
19
+
20
+ # X-CLIP (base-sized model)
21
+
22
+ X-CLIP model (base-sized, patch resolution of 16) trained in a few-shot fashion (K=8) on [HMDB-51](https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/). It was introduced in the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Ni et al. and first released in [this repository](https://github.com/microsoft/VideoX/tree/master/X-CLIP).
23
+
24
+ This model was trained using 32 frames per video, at a resolution of 224x224.
25
+
26
+ Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.
27
+
28
+ ## Model description
29
+
30
+ X-CLIP is a minimal extension of [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs.
31
+
32
+ ![X-CLIP architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png)
33
+
34
+ This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.
35
+
36
+ ## Intended uses & limitations
37
+
38
+ You can use the raw model for determining how well text goes with a given video. See the [model hub](https://huggingface.co/models?search=microsoft/xclip) to look for
39
+ fine-tuned versions on a task that interests you.
40
+
41
+ ### How to use
42
+
43
+ For code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/xclip.html#).
44
+
45
+ ## Training data
46
+
47
+ This model was trained on [HMDB-51](https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/).
48
+
49
+ ### Preprocessing
50
+
51
+ The exact details of preprocessing during training can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L247).
52
+
53
+ The exact details of preprocessing during validation can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L285).
54
+
55
+ During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.
56
+
57
+ ## Evaluation results
58
+
59
+ This model achieves a top-1 accuracy of 62.8%.