Video-Text-to-Text
Safetensors
custom_code
ynhe commited on
Commit
6575e69
β€’
1 Parent(s): 0eb2182

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +237 -247
README.md CHANGED
@@ -1,247 +1,237 @@
1
- ---
2
- license: mit
3
- pipeline_tag: video-text-to-text
4
- extra_gated_prompt: >-
5
- You agree to not use the model to conduct experiments that cause harm to human
6
- subjects.
7
- extra_gated_fields:
8
- Name: text
9
- Company/Organization: text
10
- Country: text
11
- E-Mail: text
12
- ---
13
-
14
- # InternVideo2-Chat-8B-HD
15
-
16
- [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2) [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2403.15377)
17
- <!-- [\[πŸ—¨οΈ Chat Demo\]](https://vchat.opengvlab.com/) -->
18
-
19
- To further enrich the semantics embedded in **InternVideo2** and improve its user-friendly in human communications, we tune InternVideo2 by incorporating it into a VideoLLM with a LLM and a video BLIP. We employ the progressive learning scheme in [VideoChat](https://arxiv.org/abs/2311.17005) by using InternVideo2 as the video encoder and train a video blip for
20
- communicating with open-sourced LLM. In training, the video encoder will be updated. Detailed training recipts are in [VideoChat](https://arxiv.org/abs/2311.17005).This model has HD training.
21
-
22
- The BaseLLM of this model is Mistral-7B.**Before using it, please ensure that you have obtained the access permission of Mistral-7B**, if not yet obtained, please go to[Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) to obtain the access permission and add your `HF_token` to the environment variable.
23
-
24
- ## πŸš€ How to use the model
25
-
26
- 1. Apply for the permission of this project and the base LLM permission
27
-
28
- 2. Fill the HF user access token into the environment variable
29
-
30
- ```shell
31
- export HF_TOKEN=hf_....
32
- ```
33
- If you don't know how to obtain the token starting with "hf_", please refer to: [How to Get HF User access Token](https://huggingface.co/docs/hub/security-tokens#user-access-tokens)
34
-
35
- 3. make sure to have `transformers >= 4.38.0, peft==0.5.0`
36
-
37
- Install the requisite Python packages from [pip_requirements](https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD/blob/main/requirements.txt)
38
-
39
- 4. Inference with Video input
40
-
41
- ```Python
42
- import os
43
- import torch
44
-
45
- from transformers import AutoTokenizer, AutoModel
46
-
47
- tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2_Chat_8B_InternLM2_5',
48
- trust_remote_code=True,
49
- use_fast=False,)
50
- if torch.cuda.is_available():
51
- model = AutoModel.from_pretrained(
52
- 'OpenGVLab/InternVideo2_Chat_8B_InternLM2_5',
53
- torch_dtype=torch.bfloat16,
54
- trust_remote_code=True).cuda()
55
- else:
56
- model = AutoModel.from_pretrained(
57
- 'OpenGVLab/InternVideo2_Chat_8B_InternLM2_5',
58
- torch_dtype=torch.bfloat16,
59
- trust_remote_code=True)
60
-
61
-
62
- from decord import VideoReader, cpu
63
- from PIL import Image
64
- import numpy as np
65
- import numpy as np
66
- import decord
67
- from decord import VideoReader, cpu
68
- import torch.nn.functional as F
69
- import torchvision.transforms as T
70
- from torchvision.transforms import PILToTensor
71
- from torchvision import transforms
72
- from torchvision.transforms.functional import InterpolationMode
73
- decord.bridge.set_bridge("torch")
74
-
75
- def get_index(num_frames, num_segments):
76
- seg_size = float(num_frames - 1) / num_segments
77
- start = int(seg_size / 2)
78
- offsets = np.array([
79
- start + int(np.round(seg_size * idx)) for idx in range(num_segments)
80
- ])
81
- return offsets
82
-
83
-
84
- def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
85
- vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
86
- num_frames = len(vr)
87
- frame_indices = get_index(num_frames, num_segments)
88
-
89
- mean = (0.485, 0.456, 0.406)
90
- std = (0.229, 0.224, 0.225)
91
-
92
- transform = transforms.Compose([
93
- transforms.Lambda(lambda x: x.float().div(255.0)),
94
- transforms.Normalize(mean, std)
95
- ])
96
-
97
- frames = vr.get_batch(frame_indices)
98
- frames = frames.permute(0, 3, 1, 2)
99
-
100
- if padding:
101
- frames = HD_transform_padding(frames.float(), image_size=resolution, hd_num=hd_num)
102
- else:
103
- frames = HD_transform_no_padding(frames.float(), image_size=resolution, hd_num=hd_num)
104
-
105
- frames = transform(frames)
106
- # print(frames.shape)
107
- T_, C, H, W = frames.shape
108
-
109
- sub_img = frames.reshape(
110
- 1, T_, 3, H//resolution, resolution, W//resolution, resolution
111
- ).permute(0, 3, 5, 1, 2, 4, 6).reshape(-1, T_, 3, resolution, resolution).contiguous()
112
-
113
- glb_img = F.interpolate(
114
- frames.float(), size=(resolution, resolution), mode='bicubic', align_corners=False
115
- ).to(sub_img.dtype).unsqueeze(0)
116
-
117
- frames = torch.cat([sub_img, glb_img]).unsqueeze(0)
118
-
119
- if return_msg:
120
- fps = float(vr.get_avg_fps())
121
- sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
122
- # " " should be added in the start and end
123
- msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
124
- return frames, msg
125
- else:
126
- return frames
127
-
128
- def HD_transform_padding(frames, image_size=224, hd_num=6):
129
- def _padding_224(frames):
130
- _, _, H, W = frames.shape
131
- tar = int(np.ceil(H / 224) * 224)
132
- top_padding = (tar - H) // 2
133
- bottom_padding = tar - H - top_padding
134
- left_padding = 0
135
- right_padding = 0
136
-
137
- padded_frames = F.pad(
138
- frames,
139
- pad=[left_padding, right_padding, top_padding, bottom_padding],
140
- mode='constant', value=255
141
- )
142
- return padded_frames
143
-
144
- _, _, H, W = frames.shape
145
- trans = False
146
- if W < H:
147
- frames = frames.flip(-2, -1)
148
- trans = True
149
- width, height = H, W
150
- else:
151
- width, height = W, H
152
-
153
- ratio = width / height
154
- scale = 1
155
- while scale * np.ceil(scale / ratio) <= hd_num:
156
- scale += 1
157
- scale -= 1
158
- new_w = int(scale * image_size)
159
- new_h = int(new_w / ratio)
160
-
161
- resized_frames = F.interpolate(
162
- frames, size=(new_h, new_w),
163
- mode='bicubic',
164
- align_corners=False
165
- )
166
- padded_frames = _padding_224(resized_frames)
167
-
168
- if trans:
169
- padded_frames = padded_frames.flip(-2, -1)
170
-
171
- return padded_frames
172
-
173
- def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
174
- best_ratio_diff = float('inf')
175
- best_ratio = (1, 1)
176
- area = width * height
177
- for ratio in target_ratios:
178
- target_aspect_ratio = ratio[0] / ratio[1]
179
- ratio_diff = abs(aspect_ratio - target_aspect_ratio)
180
- if ratio_diff < best_ratio_diff:
181
- best_ratio_diff = ratio_diff
182
- best_ratio = ratio
183
- elif ratio_diff == best_ratio_diff:
184
- if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
185
- best_ratio = ratio
186
- return best_ratio
187
-
188
-
189
- def HD_transform_no_padding(frames, image_size=224, hd_num=6, fix_ratio=(2,1)):
190
- min_num = 1
191
- max_num = hd_num
192
- _, _, orig_height, orig_width = frames.shape
193
- aspect_ratio = orig_width / orig_height
194
-
195
- # calculate the existing video aspect ratio
196
- target_ratios = set(
197
- (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
198
- i * j <= max_num and i * j >= min_num)
199
- target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
200
-
201
- # find the closest aspect ratio to the target
202
- if fix_ratio:
203
- target_aspect_ratio = fix_ratio
204
- else:
205
- target_aspect_ratio = find_closest_aspect_ratio(
206
- aspect_ratio, target_ratios, orig_width, orig_height, image_size)
207
-
208
- # calculate the target width and height
209
- target_width = image_size * target_aspect_ratio[0]
210
- target_height = image_size * target_aspect_ratio[1]
211
- blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
212
-
213
- # resize the frames
214
- resized_frame = F.interpolate(
215
- frames, size=(target_height, target_width),
216
- mode='bicubic', align_corners=False
217
- )
218
- return resized_frame
219
-
220
- video_path = "yoga.mp4"
221
- # sample uniformly 8 frames from the video
222
- video_tensor = load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=6)
223
- video_tensor = video_tensor.to(model.device)
224
-
225
- chat_history = []
226
- response, chat_history = model.chat(tokenizer, '', 'Describe the video step by step',instruction= "Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.\n", media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False,'max_new_tokens':512,})
227
- print(response)
228
- ```
229
-
230
- ## ✏️ Citation
231
- If this work is helpful for your research, please consider citing InternVideo and VideoChat.
232
-
233
- ```
234
- @article{wang2024internvideo2,
235
- title={Internvideo2: Scaling video foundation models for multimodal video understanding},
236
- author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
237
- journal={arXiv preprint arXiv:2403.15377},
238
- year={2024}
239
- }
240
-
241
- @article{li2023videochat,
242
- title={Videochat: Chat-centric video understanding},
243
- author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
244
- journal={arXiv preprint arXiv:2305.06355},
245
- year={2023}
246
- }
247
- ```
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: video-text-to-text
4
+ extra_gated_prompt: >-
5
+ You agree to not use the model to conduct experiments that cause harm to human
6
+ subjects.
7
+ extra_gated_fields:
8
+ Name: text
9
+ Company/Organization: text
10
+ Country: text
11
+ E-Mail: text
12
+ ---
13
+
14
+ # InternVideo2-Chat-8B-InternLM2.5
15
+
16
+ [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2) [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2403.15377)
17
+
18
+ To further enrich the semantics embedded in **InternVideo2** and improve its user-friendly in human communications, we tune InternVideo2 by incorporating it into a VideoLLM with a LLM and a video BLIP. We employ the progressive learning scheme in [VideoChat](https://arxiv.org/abs/2311.17005) by using InternVideo2 as the video encoder and train a video blip for
19
+ communicating with open-sourced LLM. In training, the video encoder will be updated. Detailed training recipts are in [VideoChat](https://arxiv.org/abs/2311.17005). This model has HD training.
20
+
21
+ The BaseLLM of this model is [InternLM2.5-7B](https://huggingface.co/internlm/internlm2_5-7b-chat-1m) with 1M long context window.
22
+
23
+ ## πŸš€ How to use the model
24
+
25
+ 1. make sure to have `transformers >= 4.38.0, peft==0.5.0`
26
+
27
+ Install the requisite Python packages from [pip_requirements](https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD/blob/main/requirements.txt)
28
+
29
+ 2. Inference with Video input
30
+
31
+ ```Python
32
+ import os
33
+ import torch
34
+
35
+ from transformers import AutoTokenizer, AutoModel
36
+
37
+ tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2_Chat_8B_InternLM2_5',
38
+ trust_remote_code=True,
39
+ use_fast=False,)
40
+ if torch.cuda.is_available():
41
+ model = AutoModel.from_pretrained(
42
+ 'OpenGVLab/InternVideo2_Chat_8B_InternLM2_5',
43
+ torch_dtype=torch.bfloat16,
44
+ trust_remote_code=True).cuda()
45
+ else:
46
+ model = AutoModel.from_pretrained(
47
+ 'OpenGVLab/InternVideo2_Chat_8B_InternLM2_5',
48
+ torch_dtype=torch.bfloat16,
49
+ trust_remote_code=True)
50
+
51
+
52
+ from decord import VideoReader, cpu
53
+ from PIL import Image
54
+ import numpy as np
55
+ import numpy as np
56
+ import decord
57
+ from decord import VideoReader, cpu
58
+ import torch.nn.functional as F
59
+ import torchvision.transforms as T
60
+ from torchvision.transforms import PILToTensor
61
+ from torchvision import transforms
62
+ from torchvision.transforms.functional import InterpolationMode
63
+ decord.bridge.set_bridge("torch")
64
+
65
+ def get_index(num_frames, num_segments):
66
+ seg_size = float(num_frames - 1) / num_segments
67
+ start = int(seg_size / 2)
68
+ offsets = np.array([
69
+ start + int(np.round(seg_size * idx)) for idx in range(num_segments)
70
+ ])
71
+ return offsets
72
+
73
+
74
+ def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
75
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
76
+ num_frames = len(vr)
77
+ frame_indices = get_index(num_frames, num_segments)
78
+
79
+ mean = (0.485, 0.456, 0.406)
80
+ std = (0.229, 0.224, 0.225)
81
+
82
+ transform = transforms.Compose([
83
+ transforms.Lambda(lambda x: x.float().div(255.0)),
84
+ transforms.Normalize(mean, std)
85
+ ])
86
+
87
+ frames = vr.get_batch(frame_indices)
88
+ frames = frames.permute(0, 3, 1, 2)
89
+
90
+ if padding:
91
+ frames = HD_transform_padding(frames.float(), image_size=resolution, hd_num=hd_num)
92
+ else:
93
+ frames = HD_transform_no_padding(frames.float(), image_size=resolution, hd_num=hd_num)
94
+
95
+ frames = transform(frames)
96
+ # print(frames.shape)
97
+ T_, C, H, W = frames.shape
98
+
99
+ sub_img = frames.reshape(
100
+ 1, T_, 3, H//resolution, resolution, W//resolution, resolution
101
+ ).permute(0, 3, 5, 1, 2, 4, 6).reshape(-1, T_, 3, resolution, resolution).contiguous()
102
+
103
+ glb_img = F.interpolate(
104
+ frames.float(), size=(resolution, resolution), mode='bicubic', align_corners=False
105
+ ).to(sub_img.dtype).unsqueeze(0)
106
+
107
+ frames = torch.cat([sub_img, glb_img]).unsqueeze(0)
108
+
109
+ if return_msg:
110
+ fps = float(vr.get_avg_fps())
111
+ sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
112
+ # " " should be added in the start and end
113
+ msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
114
+ return frames, msg
115
+ else:
116
+ return frames
117
+
118
+ def HD_transform_padding(frames, image_size=224, hd_num=6):
119
+ def _padding_224(frames):
120
+ _, _, H, W = frames.shape
121
+ tar = int(np.ceil(H / 224) * 224)
122
+ top_padding = (tar - H) // 2
123
+ bottom_padding = tar - H - top_padding
124
+ left_padding = 0
125
+ right_padding = 0
126
+
127
+ padded_frames = F.pad(
128
+ frames,
129
+ pad=[left_padding, right_padding, top_padding, bottom_padding],
130
+ mode='constant', value=255
131
+ )
132
+ return padded_frames
133
+
134
+ _, _, H, W = frames.shape
135
+ trans = False
136
+ if W < H:
137
+ frames = frames.flip(-2, -1)
138
+ trans = True
139
+ width, height = H, W
140
+ else:
141
+ width, height = W, H
142
+
143
+ ratio = width / height
144
+ scale = 1
145
+ while scale * np.ceil(scale / ratio) <= hd_num:
146
+ scale += 1
147
+ scale -= 1
148
+ new_w = int(scale * image_size)
149
+ new_h = int(new_w / ratio)
150
+
151
+ resized_frames = F.interpolate(
152
+ frames, size=(new_h, new_w),
153
+ mode='bicubic',
154
+ align_corners=False
155
+ )
156
+ padded_frames = _padding_224(resized_frames)
157
+
158
+ if trans:
159
+ padded_frames = padded_frames.flip(-2, -1)
160
+
161
+ return padded_frames
162
+
163
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
164
+ best_ratio_diff = float('inf')
165
+ best_ratio = (1, 1)
166
+ area = width * height
167
+ for ratio in target_ratios:
168
+ target_aspect_ratio = ratio[0] / ratio[1]
169
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
170
+ if ratio_diff < best_ratio_diff:
171
+ best_ratio_diff = ratio_diff
172
+ best_ratio = ratio
173
+ elif ratio_diff == best_ratio_diff:
174
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
175
+ best_ratio = ratio
176
+ return best_ratio
177
+
178
+
179
+ def HD_transform_no_padding(frames, image_size=224, hd_num=6, fix_ratio=(2,1)):
180
+ min_num = 1
181
+ max_num = hd_num
182
+ _, _, orig_height, orig_width = frames.shape
183
+ aspect_ratio = orig_width / orig_height
184
+
185
+ # calculate the existing video aspect ratio
186
+ target_ratios = set(
187
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
188
+ i * j <= max_num and i * j >= min_num)
189
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
190
+
191
+ # find the closest aspect ratio to the target
192
+ if fix_ratio:
193
+ target_aspect_ratio = fix_ratio
194
+ else:
195
+ target_aspect_ratio = find_closest_aspect_ratio(
196
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
197
+
198
+ # calculate the target width and height
199
+ target_width = image_size * target_aspect_ratio[0]
200
+ target_height = image_size * target_aspect_ratio[1]
201
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
202
+
203
+ # resize the frames
204
+ resized_frame = F.interpolate(
205
+ frames, size=(target_height, target_width),
206
+ mode='bicubic', align_corners=False
207
+ )
208
+ return resized_frame
209
+
210
+ video_path = "yoga.mp4"
211
+ # sample uniformly 8 frames from the video
212
+ video_tensor = load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=6)
213
+ video_tensor = video_tensor.to(model.device)
214
+
215
+ chat_history = []
216
+ response, chat_history = model.chat(tokenizer, '', 'Describe the video step by step',instruction= "Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.\n", media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False,'max_new_tokens':512,})
217
+ print(response)
218
+ ```
219
+
220
+ ## ✏️ Citation
221
+ If this work is helpful for your research, please consider citing InternVideo and VideoChat.
222
+
223
+ ```
224
+ @article{wang2024internvideo2,
225
+ title={Internvideo2: Scaling video foundation models for multimodal video understanding},
226
+ author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
227
+ journal={arXiv preprint arXiv:2403.15377},
228
+ year={2024}
229
+ }
230
+
231
+ @article{li2023videochat,
232
+ title={Videochat: Chat-centric video understanding},
233
+ author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
234
+ journal={arXiv preprint arXiv:2305.06355},
235
+ year={2023}
236
+ }
237
+ ```