zR commited on
Commit
4c3068a
1 Parent(s): 600a885
Files changed (2) hide show
  1. README.md +135 -48
  2. README_zh.md +143 -0
README.md CHANGED
@@ -1,51 +1,138 @@
1
- ---
2
- frameworks:
3
- - Pytorch
4
- license: other
5
- tasks:
6
- - image-text-to-text
7
-
8
- #model-type:
9
- ##如 gpt、phi、llama、chatglm、baichuan 等
10
- #- gpt
11
-
12
- #domain:
13
- ##如 nlp、cv、audio、multi-modal
14
- #- nlp
15
-
16
- #language:
17
- ##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
18
- #- cn
19
-
20
- #metrics:
21
- ##如 CIDEr、Blue、ROUGE 等
22
- #- CIDEr
23
-
24
- #tags:
25
- ##各种自定义,包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
26
- #- pretrained
27
-
28
- #tools:
29
- ##如 vllm、fastchat、llamacpp、AdaSeq 等
30
- #- vllm
31
- ---
32
- ### 当前模型的贡献者未提供更加详细的模型介绍。模型文件和权重,可浏览“模型文件”页面获取。
33
- #### 您可以通过如下git clone命令,或者ModelScope SDK来下载模型
34
-
35
- SDK下载
36
- ```bash
37
- #安装ModelScope
38
- pip install modelscope
39
- ```
40
  ```python
41
- #SDK模型下载
42
- from modelscope import snapshot_download
43
- model_dir = snapshot_download('ZhipuAI/cogvlm2-llama3-caption')
44
- ```
45
- Git下载
46
- ```
47
- #Git模型下载
48
- git clone https://www.modelscope.cn/ZhipuAI/cogvlm2-llama3-caption.git
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```
50
 
51
- <p style="color: lightgrey;">如果您是本模型的贡献者,我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>,及时完善模型卡片内容。</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CogVLM2-Llama3-Caption
2
+
3
+ <div align="center">
4
+ <img src=https://raw.githubusercontent.com/THUDM/CogVLM2/cf9cb3c60a871e0c8e5bde7feaf642e3021153e6/resources/logo.svg>
5
+ </div>
6
+
7
+ 通常情况下,大部分视频数据并没有附带相应的描述性文本,因此有必要将视频数据转换成文本描述,以提供文本到视频模型所需的必要训练数据。
8
+
9
+ ## 使用方式
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ```python
11
+ import io
12
+ import numpy as np
13
+ import torch
14
+ from decord import cpu, VideoReader, bridge
15
+ from transformers import AutoModelForCausalLM, AutoTokenizer
16
+ import argparse
17
+
18
+ MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
19
+
20
+ DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
21
+ TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
22
+ 0] >= 8 else torch.float16
23
+
24
+ parser = argparse.ArgumentParser(description="CogVLM2-Video CLI Demo")
25
+ parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
26
+ args = parser.parse_args([])
27
+
28
+
29
+ def load_video(video_data, strategy='chat'):
30
+ bridge.set_bridge('torch')
31
+ mp4_stream = video_data
32
+ num_frames = 24
33
+ decord_vr = VideoReader(io.BytesIO(mp4_stream), ctx=cpu(0))
34
+
35
+ frame_id_list = None
36
+ total_frames = len(decord_vr)
37
+ if strategy == 'base':
38
+ clip_end_sec = 60
39
+ clip_start_sec = 0
40
+ start_frame = int(clip_start_sec * decord_vr.get_avg_fps())
41
+ end_frame = min(total_frames,
42
+ int(clip_end_sec * decord_vr.get_avg_fps())) if clip_end_sec is not None else total_frames
43
+ frame_id_list = np.linspace(start_frame, end_frame - 1, num_frames, dtype=int)
44
+ elif strategy == 'chat':
45
+ timestamps = decord_vr.get_frame_timestamp(np.arange(total_frames))
46
+ timestamps = [i[0] for i in timestamps]
47
+ max_second = round(max(timestamps)) + 1
48
+ frame_id_list = []
49
+ for second in range(max_second):
50
+ closest_num = min(timestamps, key=lambda x: abs(x - second))
51
+ index = timestamps.index(closest_num)
52
+ frame_id_list.append(index)
53
+ if len(frame_id_list) >= num_frames:
54
+ break
55
+
56
+ video_data = decord_vr.get_batch(frame_id_list)
57
+ video_data = video_data.permute(3, 0, 1, 2)
58
+ return video_data
59
+
60
+
61
+ tokenizer = AutoTokenizer.from_pretrained(
62
+ MODEL_PATH,
63
+ trust_remote_code=True,
64
+ # padding_side="left"
65
+ )
66
+
67
+ model = AutoModelForCausalLM.from_pretrained(
68
+ MODEL_PATH,
69
+ torch_dtype=TORCH_TYPE,
70
+ trust_remote_code=True
71
+ ).eval().to(DEVICE)
72
+
73
+
74
+ def predict(prompt, video_data, temperature):
75
+ strategy = 'chat'
76
+
77
+ video = load_video(video_data, strategy=strategy)
78
+
79
+ history = []
80
+ query = prompt
81
+ inputs = model.build_conversation_input_ids(
82
+ tokenizer=tokenizer,
83
+ query=query,
84
+ images=[video],
85
+ history=history,
86
+ template_version=strategy
87
+ )
88
+ inputs = {
89
+ 'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
90
+ 'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
91
+ 'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
92
+ 'images': [[inputs['images'][0].to('cuda').to(TORCH_TYPE)]],
93
+ }
94
+ gen_kwargs = {
95
+ "max_new_tokens": 2048,
96
+ "pad_token_id": 128002,
97
+ "top_k": 1,
98
+ "do_sample": False,
99
+ "top_p": 0.1,
100
+ "temperature": temperature,
101
+ }
102
+ with torch.no_grad():
103
+ outputs = model.generate(**inputs, **gen_kwargs)
104
+ outputs = outputs[:, inputs['input_ids'].shape[1]:]
105
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
106
+ return response
107
+
108
+
109
+ def test():
110
+ prompt = "Please describe this video in detail."
111
+ temperature = 0.1
112
+ video_data = open('test.mp4', 'rb').read()
113
+ response = predict(prompt, video_data, temperature)
114
+ print(response)
115
+
116
+
117
+ if __name__ == '__main__':
118
+ test()
119
+
120
  ```
121
 
122
+ ## 模型协议
123
+
124
+ 此模型根据 CogVLM2 [LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) 发布。对于使用 Meta Llama 3 构建的模型,还请遵守
125
+ [LLAMA3_LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0)。
126
+
127
+ ## 引用
128
+
129
+ 🌟 If you find our work helpful, please leave us a star and cite our paper.
130
+
131
+ ```
132
+ @article{yang2024cogvideox,
133
+ title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
134
+ author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
135
+ journal={arXiv preprint arXiv:2408.06072},
136
+ year={2024}
137
+ }
138
+ ```
README_zh.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CogVLM2-Llama3-Caption
2
+
3
+ <div align="center">
4
+ <img src=https://raw.githubusercontent.com/THUDM/CogVLM2/cf9cb3c60a871e0c8e5bde7feaf642e3021153e6/resources/logo.svg>
5
+ </div>
6
+
7
+ # Introduction
8
+
9
+ Typically, most video data does not come with corresponding descriptive text, so it is necessary to convert the video
10
+ data into textual descriptions to provide the essential training data for text-to-video models.
11
+
12
+ ## Usage
13
+
14
+ ```python
15
+ import io
16
+ import numpy as np
17
+ import torch
18
+ from decord import cpu, VideoReader, bridge
19
+ from transformers import AutoModelForCausalLM, AutoTokenizer
20
+ import argparse
21
+
22
+ MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
23
+
24
+ DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
25
+ TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
26
+ 0] >= 8 else torch.float16
27
+
28
+ parser = argparse.ArgumentParser(description="CogVLM2-Video CLI Demo")
29
+ parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0)
30
+ args = parser.parse_args([])
31
+
32
+
33
+ def load_video(video_data, strategy='chat'):
34
+ bridge.set_bridge('torch')
35
+ mp4_stream = video_data
36
+ num_frames = 24
37
+ decord_vr = VideoReader(io.BytesIO(mp4_stream), ctx=cpu(0))
38
+
39
+ frame_id_list = None
40
+ total_frames = len(decord_vr)
41
+ if strategy == 'base':
42
+ clip_end_sec = 60
43
+ clip_start_sec = 0
44
+ start_frame = int(clip_start_sec * decord_vr.get_avg_fps())
45
+ end_frame = min(total_frames,
46
+ int(clip_end_sec * decord_vr.get_avg_fps())) if clip_end_sec is not None else total_frames
47
+ frame_id_list = np.linspace(start_frame, end_frame - 1, num_frames, dtype=int)
48
+ elif strategy == 'chat':
49
+ timestamps = decord_vr.get_frame_timestamp(np.arange(total_frames))
50
+ timestamps = [i[0] for i in timestamps]
51
+ max_second = round(max(timestamps)) + 1
52
+ frame_id_list = []
53
+ for second in range(max_second):
54
+ closest_num = min(timestamps, key=lambda x: abs(x - second))
55
+ index = timestamps.index(closest_num)
56
+ frame_id_list.append(index)
57
+ if len(frame_id_list) >= num_frames:
58
+ break
59
+
60
+ video_data = decord_vr.get_batch(frame_id_list)
61
+ video_data = video_data.permute(3, 0, 1, 2)
62
+ return video_data
63
+
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained(
66
+ MODEL_PATH,
67
+ trust_remote_code=True,
68
+ # padding_side="left"
69
+ )
70
+
71
+ model = AutoModelForCausalLM.from_pretrained(
72
+ MODEL_PATH,
73
+ torch_dtype=TORCH_TYPE,
74
+ trust_remote_code=True
75
+ ).eval().to(DEVICE)
76
+
77
+
78
+ def predict(prompt, video_data, temperature):
79
+ strategy = 'chat'
80
+
81
+ video = load_video(video_data, strategy=strategy)
82
+
83
+ history = []
84
+ query = prompt
85
+ inputs = model.build_conversation_input_ids(
86
+ tokenizer=tokenizer,
87
+ query=query,
88
+ images=[video],
89
+ history=history,
90
+ template_version=strategy
91
+ )
92
+ inputs = {
93
+ 'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
94
+ 'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
95
+ 'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
96
+ 'images': [[inputs['images'][0].to('cuda').to(TORCH_TYPE)]],
97
+ }
98
+ gen_kwargs = {
99
+ "max_new_tokens": 2048,
100
+ "pad_token_id": 128002,
101
+ "top_k": 1,
102
+ "do_sample": False,
103
+ "top_p": 0.1,
104
+ "temperature": temperature,
105
+ }
106
+ with torch.no_grad():
107
+ outputs = model.generate(**inputs, **gen_kwargs)
108
+ outputs = outputs[:, inputs['input_ids'].shape[1]:]
109
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
110
+ return response
111
+
112
+
113
+ def test():
114
+ prompt = "Please describe this video in detail."
115
+ temperature = 0.1
116
+ video_data = open('test.mp4', 'rb').read()
117
+ response = predict(prompt, video_data, temperature)
118
+ print(response)
119
+
120
+
121
+ if __name__ == '__main__':
122
+ test()
123
+
124
+ ```
125
+
126
+ ## License
127
+
128
+ This model is released under the
129
+ CogVLM2 [LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0).
130
+ For models built with Meta Llama 3, please also adhere to
131
+ the [LLAMA3_LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0).
132
+
133
+ ## Citation
134
+
135
+ 🌟 If you find our work helpful, please leave us a star and cite our paper.
136
+
137
+ ```
138
+ @article{yang2024cogvideox,
139
+ title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
140
+ author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
141
+ journal={arXiv preprint arXiv:2408.06072},
142
+ year={2024}
143
+ }