Video-Text-to-Text
Transformers
Safetensors
English
llava
text-generation
multimodal
Eval Results
Inference Endpoints
ZhangYuanhan commited on
Commit
d593f18
1 Parent(s): eb11cb4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +227 -3
README.md CHANGED
@@ -1,3 +1,227 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - lmms-lab/LLaVA-NeXT-Video-SFT-Data
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ metrics:
9
+ - accuracy
10
+ tags:
11
+ - multimodal
12
+ model-index:
13
+ - name: LLaVA-NeXT-Video-7B-Qwen2
14
+ results:
15
+ - task:
16
+ type: multimodal
17
+ dataset:
18
+ name: ActNet-QA
19
+ type: actnet-qa
20
+ metrics:
21
+ - type: accuracy
22
+ value: 56.5
23
+ name: accuracy
24
+ verified: true
25
+ - task:
26
+ type: multimodal
27
+ dataset:
28
+ name: EgoSchema
29
+ type: egoschema
30
+ metrics:
31
+ - type: accuracy
32
+ value: 57.3
33
+ name: accuracy
34
+ verified: true
35
+ - task:
36
+ type: multimodal
37
+ dataset:
38
+ name: MLVU
39
+ type: mlvu
40
+ metrics:
41
+ - type: accuracy
42
+ value: 70.8
43
+ name: accuracy
44
+ verified: true
45
+ - task:
46
+ type: multimodal
47
+ dataset:
48
+ name: MVBench
49
+ type: mvbench
50
+ metrics:
51
+ - type: accuracy
52
+ value: 58.6
53
+ name: accuracy
54
+ verified: true
55
+ - task:
56
+ type: multimodal
57
+ dataset:
58
+ name: NextQA
59
+ type: nextqa
60
+ metrics:
61
+ - type: accuracy
62
+ value: 83.2
63
+ name: accuracy
64
+ verified: true
65
+ - task:
66
+ type: multimodal
67
+ dataset:
68
+ name: PercepTest
69
+ type: percepTest
70
+ metrics:
71
+ - type: accuracy
72
+ value: 67.9
73
+ name: accuracy
74
+ verified: true
75
+ - task:
76
+ type: multimodal
77
+ dataset:
78
+ name: VideoChatGPT
79
+ type: videochatgpt
80
+ metrics:
81
+ - type: score
82
+ value: 3.52
83
+ name: score
84
+ verified: true
85
+ - task:
86
+ type: multimodal
87
+ dataset:
88
+ name: VideoDC
89
+ type: videodc
90
+ metrics:
91
+ - type: score
92
+ value: 3.66
93
+ name: score
94
+ verified: true
95
+ - task:
96
+ type: multimodal
97
+ dataset:
98
+ name: LongVideoBench
99
+ type: longvideobench
100
+ metrics:
101
+ - type: accuracy
102
+ value: 58.2
103
+ name: accuracy
104
+ verified: true
105
+ - task:
106
+ type: multimodal
107
+ dataset:
108
+ name: VideoMME
109
+ type: videomme
110
+ metrics:
111
+ - type: accuracy
112
+ value: 63.3
113
+ name: accuracy
114
+ verified: true
115
+ ---
116
+
117
+ # LLaVA-NeXT-Video
118
+
119
+ ## Table of Contents
120
+
121
+ 1. [Model Summary](##model-summary)
122
+ 2. [Use](##use)
123
+ 3. [Limitations](##limitations)
124
+ 4. [Training](##training)
125
+ 5. [License](##license)
126
+ 6. [Citation](##citation)
127
+
128
+ ## Model Summary
129
+
130
+ The LLaVA-OneVision models are 7/72B parameter models trained on [LLaVA-NeXT-Video-SFT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data), based on Qwen2 language model with a context window of 32K tokens.
131
+
132
+ - **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
133
+ - **Point of Contact:** [Yuanhan Zhang](mailto:[email protected])
134
+ - **Languages:** English, Chinese
135
+
136
+
137
+ ## Use
138
+
139
+ ### Intended use
140
+
141
+ The model was trained on [LLaVA-NeXT-Video-SFT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data) and have the ability to interact with images, multi-image and videos, but specific to videos.
142
+
143
+ **Feel free to share your generations in the Community tab!**
144
+
145
+ ### Generation
146
+
147
+ We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
148
+
149
+ ```python
150
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
151
+ from llava.model.builder import load_pretrained_model
152
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
153
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
154
+ from llava.conversation import conv_templates, SeparatorStyle
155
+ from PIL import Image
156
+ import requests
157
+ import copy
158
+ import torch
159
+ import sys
160
+ import warnings
161
+ from decord import VideoReader, cpu
162
+ import numpy as np
163
+ warnings.filterwarnings("ignore")
164
+ def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
165
+ if max_frames_num == 0:
166
+ return np.zeros((1, 336, 336, 3))
167
+ vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
168
+ total_frame_num = len(vr)
169
+ video_time = total_frame_num / vr.get_avg_fps()
170
+ fps = round(vr.get_avg_fps()/fps)
171
+ frame_idx = [i for i in range(0, len(vr), fps)]
172
+ frame_time = [i/fps for i in frame_idx]
173
+ if len(frame_idx) > max_frames_num or force_sample:
174
+ sample_fps = max_frames_num
175
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
176
+ frame_idx = uniform_sampled_frames.tolist()
177
+ frame_time = [i/vr.get_avg_fps() for i in frame_idx]
178
+ frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
179
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
180
+ # import pdb;pdb.set_trace()
181
+ return spare_frames,frame_time,video_time
182
+ pretrained = "lmms-lab/LLaVA-NeXT-Video-7B-Qwen2"
183
+ model_name = "llava_qwen"
184
+ device = "cuda"
185
+ device_map = "auto"
186
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
187
+ model.eval()
188
+ video_path = "XXXX"
189
+ max_frames_num = "64"
190
+ video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
191
+ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
192
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
193
+ question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
194
+ conv = copy.deepcopy(conv_templates[conv_template])
195
+ conv.append_message(conv.roles[0], question)
196
+ conv.append_message(conv.roles[1], None)
197
+ prompt_question = conv.get_prompt()
198
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
199
+ cont = model.generate(
200
+ input_ids,
201
+ images=video,
202
+ modalities="video"
203
+ do_sample=False,
204
+ temperature=0,
205
+ max_new_tokens=4096,
206
+ )
207
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
208
+ print(text_outputs)
209
+ ```
210
+
211
+
212
+ # Training
213
+
214
+ ## Model
215
+
216
+ - **Architecture:** SO400M + Qwen2
217
+ - **Initialized Model:** lmms-lab/llava-onevision-qwen2-7b-si
218
+ - **Data:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
219
+ - **Precision:** bfloat16
220
+
221
+ ## Hardware & Software
222
+
223
+ - **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
224
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
225
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
226
+
227
+ # Citation