Fix a couple of typos and add some metadata tags

#7
by pcuenq HF staff - opened
Files changed (1) hide show
  1. README.md +13 -5
README.md CHANGED
@@ -1,6 +1,14 @@
1
  ---
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
4
  ---
5
 
6
  # Model Card for Video-LLaVa
@@ -10,11 +18,11 @@ tags: []
10
 
11
 
12
  **Model type:**
13
- Video-LLaVA is an open-source multomodal model trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.
14
  Base LLM: [lmsys/vicuna-13b-v1.5](https://huggingface.co/lmsys/vicuna-13b-v1.5)
15
 
16
  **Model Description:**
17
- The model can generate interleaving images and videos, despite the absence of image-video pairs in the dataset. Video-LLaVa is uses an encoder trained for unified visual representation through alignment prior to projection.
18
  Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.
19
 
20
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videollava_example.png"
@@ -103,8 +111,8 @@ print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_to
103
 
104
 
105
  ## πŸ‘ Acknowledgement
106
- * [LLaVA](https://github.com/haotian-liu/LLaVA) The codebase we built upon and it is an efficient large language and vision assistant.
107
- * [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT) Great job contributing the evaluation code and dataset.
108
 
109
  ## πŸ”’ License
110
  * The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/LICENSE) file.
 
1
  ---
2
+ language:
3
+ - en
4
  library_name: transformers
5
+ license: apache-2.0
6
+ pipeline_tag: video-text-to-text
7
+ datasets:
8
+ - liuhaotian/LLaVA-Pretrain
9
+ - liuhaotian/LLaVA-Instruct-150K
10
+ - luoruipu1/Valley-Instruct-65k
11
+ - lmms-lab/VideoChatGPT
12
  ---
13
 
14
  # Model Card for Video-LLaVa
 
18
 
19
 
20
  **Model type:**
21
+ Video-LLaVA is an open-source multimodal model trained by fine-tuning an LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.
22
  Base LLM: [lmsys/vicuna-13b-v1.5](https://huggingface.co/lmsys/vicuna-13b-v1.5)
23
 
24
  **Model Description:**
25
+ The model can generate text from interleaving images and videos, despite the absence of image-video pairs in the dataset. Video-LLaVa uses an encoder trained for unified visual representation through alignment prior to projection.
26
  Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.
27
 
28
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videollava_example.png"
 
111
 
112
 
113
  ## πŸ‘ Acknowledgement
114
+ * [LLaVA](https://github.com/haotian-liu/LLaVA) The codebase we built upon, LlaVA is an efficient large language model and vision assistant.
115
+ * [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT) We are grateful for the contribution of the evaluation code and dataset.
116
 
117
  ## πŸ”’ License
118
  * The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/LICENSE) file.