czczup commited on
Commit
c0189e3
1 Parent(s): b6807cb

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -11,7 +11,7 @@ pipeline_tag: image-text-to-text
11
 
12
  ## Introduction
13
 
14
- We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large language models. InternVL 2.0 features a variety of **instruction-tuned models**, ranging from 2 billion to 108 billion parameters. This repository contains the instruction-tuned InternVL2-Llama3-76B model.
15
 
16
  Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-source models. It demonstrates competitive performance on par with proprietary commercial models across various capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities.
17
 
@@ -29,23 +29,23 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
29
  | :--------------------------: | :-------------: | :------------: | :-----------: | :------------------: |
30
  | Model Size | - | - | 40B | 76B |
31
  | | | | | |
32
- | DocVQA<sub>test</sub> | 87.2 | 86.5 | 93.9 | |
33
- | ChartQA<sub>test</sub> | 78.1 | 81.3 | 86.2 | |
34
- | InfoVQA<sub>test</sub> | - | 72.7 | 78.7 | |
35
- | TextVQA<sub>val</sub> | - | 73.5 | 83.0 | |
36
- | OCRBench | 678 | 754 | 837 | |
37
- | MME<sub>sum</sub> | 2070.2 | 2110.6 | 2315.0 | |
38
- | RealWorldQA | 68.0 | 67.5 | 71.8 | |
39
- | AI2D<sub>test</sub> | 89.4 | 80.3 | 87.1 | |
40
- | MMMU<sub>val</sub> | 63.1 | 58.5 | 53.9 | |
41
- | MMBench-EN<sub>test</sub> | 81.0 | 73.9 | 86.8 | |
42
- | MMBench-CN<sub>test</sub> | 80.2 | 73.8 | 86.5 | |
43
- | CCBench<sub>dev</sub> | 57.3 | 28.4 | 80.6 | |
44
- | MMVet<sub>GPT-4-0613</sub> | - | - | 68.5 | |
45
- | MMVet<sub>GPT-4-Turbo</sub> | 67.5 | 64.0 | 65.5 | |
46
- | SEED-Image | - | - | 78.2 | |
47
- | HallBench<sub>avg</sub> | 43.9 | 45.6 | 56.9 | |
48
- | MathVista<sub>testmini</sub> | 58.1 | 57.7 | 63.7 | |
49
 
50
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
51
 
@@ -59,9 +59,9 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
59
  | :------------------: | :----: | :------: | :--------------: | :-----------: | :------------------: |
60
  | Model Size | - | 34B | 34B | 40B | 76B |
61
  | | | | | | |
62
- | MVBench | - | - | - | 72.5 | |
63
- | Video-MME<br>wo subs | 59.9 | 59.0 | 52.0 | TBD | TBD |
64
- | Video-MME<br>w/ subs | 63.3 | 59.4 | 54.9 | TBD | TBD |
65
 
66
  - We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a 448x448 image.
67
 
@@ -71,6 +71,8 @@ Limitations: Although we have made efforts to ensure the safety of the model dur
71
 
72
  We provide an example code to run InternVL2-Llama3-76B using `transformers`.
73
 
 
 
74
  > Please use transformers==4.37.2 to ensure the model works normally.
75
 
76
  ```python
@@ -317,6 +319,16 @@ print(f'Assistant: {response}')
317
 
318
  ## Deployment
319
 
 
 
 
 
 
 
 
 
 
 
320
  TODO
321
 
322
  ## License
@@ -344,7 +356,7 @@ If you find this project useful in your research, please consider citing:
344
 
345
  ## 简介
346
 
347
- 我们很高兴宣布 InternVL 2.0 的发布,这是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种**指令微调**的模型,参数从 20 亿到 1080 亿不等。此仓库包含经过指令微调的 InternVL2-Llama3-76B 模型。
348
 
349
  与最先进的开源多模态大语言模型相比,InternVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力,包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决,以及文化理解和综合多模态能力。
350
 
@@ -393,8 +405,8 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
393
  | 模型大小 | - | 34B | 34B | 40B | 76B |
394
  | | | | | | |
395
  | MVBench | - | - | - | 72.5 | |
396
- | Video-MME<br>wo subs | 59.9 | 59.0 | 52.0 | TBD | TBD |
397
- | Video-MME<br>w/ subs | 63.3 | 59.4 | 54.9 | TBD | TBD |
398
 
399
  - 我们通过从每个视频中提取16帧来评估我们的模型在MVBench上的性能,每个视频帧被调整为448x448的图像。
400
 
@@ -404,6 +416,8 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
404
 
405
  我们提供了一个示例代码,用于使用 `transformers` 运行 InternVL2-Llama3-76B。
406
 
 
 
407
  > 请使用 transformers==4.37.2 以确保模型正常运行。
408
 
409
  示例代码请[点击这里](#quick-start)。
@@ -414,6 +428,14 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
414
 
415
  TODO
416
 
 
 
 
 
 
 
 
 
417
  ## 开源许可证
418
 
419
  该项目采用 MIT 许可证发布,而 LLama3 则采用 Llama 3 Community License 许可证。
 
11
 
12
  ## Introduction
13
 
14
+ We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large language models. InternVL 2.0 features a variety of **instruction-tuned models**, ranging from 1 billion to 108 billion parameters. This repository contains the instruction-tuned InternVL2-Llama3-76B model.
15
 
16
  Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-source models. It demonstrates competitive performance on par with proprietary commercial models across various capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities.
17
 
 
29
  | :--------------------------: | :-------------: | :------------: | :-----------: | :------------------: |
30
  | Model Size | - | - | 40B | 76B |
31
  | | | | | |
32
+ | DocVQA<sub>test</sub> | 87.2 | 86.5 | 93.9 | TODO |
33
+ | ChartQA<sub>test</sub> | 78.1 | 81.3 | 86.2 | 88.4 |
34
+ | InfoVQA<sub>test</sub> | - | 72.7 | 78.7 | 82.0 |
35
+ | TextVQA<sub>val</sub> | - | 73.5 | 83.0 | 84.4 |
36
+ | OCRBench | 678 | 754 | 837 | TODO |
37
+ | MME<sub>sum</sub> | 2070.2 | 2110.6 | 2315.0 | 2414.7 |
38
+ | RealWorldQA | 68.0 | 67.5 | 71.8 | TODO |
39
+ | AI2D<sub>test</sub> | 89.4 | 80.3 | 87.1 | 87.6 |
40
+ | MMMU<sub>val</sub> | 63.1 | 58.5 | 53.9 | 55.2 |
41
+ | MMBench-EN<sub>test</sub> | 81.0 | 73.9 | 86.8 | 86.5 |
42
+ | MMBench-CN<sub>test</sub> | 80.2 | 73.8 | 86.5 | 86.3 |
43
+ | CCBench<sub>dev</sub> | 57.3 | 28.4 | 80.6 | 81.0 |
44
+ | MMVet<sub>GPT-4-0613</sub> | - | - | 68.5 | 69.8 |
45
+ | MMVet<sub>GPT-4-Turbo</sub> | 67.5 | 64.0 | 65.5 | TODO |
46
+ | SEED-Image | - | - | 78.2 | 78.2 |
47
+ | HallBench<sub>avg</sub> | 43.9 | 45.6 | 56.9 | TODO |
48
+ | MathVista<sub>testmini</sub> | 58.1 | 57.7 | 63.7 | TODO |
49
 
50
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
51
 
 
59
  | :------------------: | :----: | :------: | :--------------: | :-----------: | :------------------: |
60
  | Model Size | - | 34B | 34B | 40B | 76B |
61
  | | | | | | |
62
+ | MVBench | - | - | - | 72.5 | TODO |
63
+ | Video-MME<br>wo subs | 59.9 | 59.0 | 52.0 | TODO | TODO |
64
+ | Video-MME<br>w/ subs | 63.3 | 59.4 | 54.9 | TODO | TODO |
65
 
66
  - We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a 448x448 image.
67
 
 
71
 
72
  We provide an example code to run InternVL2-Llama3-76B using `transformers`.
73
 
74
+ We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/). Currently, due to the limited GPU resources with public IP addresses, we can only deploy models up to a maximum of 26B. We will expand soon and deploy larger models to the online demo.
75
+
76
  > Please use transformers==4.37.2 to ensure the model works normally.
77
 
78
  ```python
 
319
 
320
  ## Deployment
321
 
322
+ ### LMDeploy
323
+
324
+ TODO
325
+
326
+ ### vLLM
327
+
328
+ TODO
329
+
330
+ ### Ollama
331
+
332
  TODO
333
 
334
  ## License
 
356
 
357
  ## 简介
358
 
359
+ 我们很高兴宣布 InternVL 2.0 的发布,这是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种**指令微调**的模型,参数从 10 亿到 1080 亿不等。此仓库包含经过指令微调的 InternVL2-Llama3-76B 模型。
360
 
361
  与最先进的开源多模态大语言模型相比,InternVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力,包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决,以及文化理解和综合多模态能力。
362
 
 
405
  | 模型大小 | - | 34B | 34B | 40B | 76B |
406
  | | | | | | |
407
  | MVBench | - | - | - | 72.5 | |
408
+ | Video-MME<br>wo subs | 59.9 | 59.0 | 52.0 | TODO | TODO |
409
+ | Video-MME<br>w/ subs | 63.3 | 59.4 | 54.9 | TODO | TODO |
410
 
411
  - 我们通过从每个视频中提取16帧来评估我们的模型在MVBench上的性能,每个视频帧被调整为448x448的图像。
412
 
 
416
 
417
  我们提供了一个示例代码,用于使用 `transformers` 运行 InternVL2-Llama3-76B。
418
 
419
+ 我们也欢迎你在我们的[在线demo](https://internvl.opengvlab.com/)中体验InternVL2的系列模型。目前,由于具备公网IP地址的GPU资源有限,我们目前只能部署最大到26B的模型。我们会在不久之后进行扩容,把更大的模型部署到在线demo上,敬请期待。
420
+
421
  > 请使用 transformers==4.37.2 以确保模型正常运行。
422
 
423
  示例代码请[点击这里](#quick-start)。
 
428
 
429
  TODO
430
 
431
+ ### vLLM
432
+
433
+ TODO
434
+
435
+ ### Ollama
436
+
437
+ TODO
438
+
439
  ## 开源许可证
440
 
441
  该项目采用 MIT 许可证发布,而 LLama3 则采用 Llama 3 Community License 许可证。
configuration_intern_vit.py CHANGED
@@ -1,6 +1,6 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import os
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import os
configuration_internvl_chat.py CHANGED
@@ -1,6 +1,6 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
 
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
 
conversation.py CHANGED
@@ -330,13 +330,16 @@ def get_conv_template(name: str) -> Conversation:
330
  return conv_templates[name].copy()
331
 
332
 
333
- # Note that for inference, using the Hermes-2 and internlm2-chat templates is equivalent.
 
 
 
334
  register_conv_template(
335
  Conversation(
336
  name='Hermes-2',
337
  system_template='<|im_start|>system\n{system_message}',
338
  # note: The new system prompt was not used here to avoid changes in benchmark performance.
339
- # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室及多家合作单位联合开发的多模态大语言模型。人工智能实验室致力于原始技术创新,开源开放,共享共创,推动科技进步和产业发展。',
340
  system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
341
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
342
  sep_style=SeparatorStyle.MPT,
@@ -357,7 +360,7 @@ register_conv_template(
357
  name='internlm2-chat',
358
  system_template='<|im_start|>system\n{system_message}',
359
  # note: The new system prompt was not used here to avoid changes in benchmark performance.
360
- # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室及多家合作单位联合开发的多模态大语言模型。人工智能实验室致力于原始技术创新,开源开放,共享共创,推动科技进步和产业发展。',
361
  system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
362
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
363
  sep_style=SeparatorStyle.MPT,
@@ -376,7 +379,7 @@ register_conv_template(
376
  name='phi3-chat',
377
  system_template='<|system|>\n{system_message}',
378
  # note: The new system prompt was not used here to avoid changes in benchmark performance.
379
- # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室及多家合作单位联合开发的多模态大语言模型。人工智能实验室致力于原始技术创新,开源开放,共享共创,推动科技进步和产业发展。',
380
  system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
381
  roles=('<|user|>\n', '<|assistant|>\n'),
382
  sep_style=SeparatorStyle.MPT,
 
330
  return conv_templates[name].copy()
331
 
332
 
333
+ # Both Hermes-2 and internlm2-chat are chatml-format conversation templates. The difference
334
+ # is that during training, the preprocessing function for the Hermes-2 template doesn't add
335
+ # <s> at the beginning of the tokenized sequence, while the internlm2-chat template does.
336
+ # Therefore, they are completely equivalent during inference.
337
  register_conv_template(
338
  Conversation(
339
  name='Hermes-2',
340
  system_template='<|im_start|>system\n{system_message}',
341
  # note: The new system prompt was not used here to avoid changes in benchmark performance.
342
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室及多家合作单位联合开发的多模态大语言模型。',
343
  system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
344
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
345
  sep_style=SeparatorStyle.MPT,
 
360
  name='internlm2-chat',
361
  system_template='<|im_start|>system\n{system_message}',
362
  # note: The new system prompt was not used here to avoid changes in benchmark performance.
363
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室及多家合作单位联合开发的多模态大语言模型。',
364
  system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
365
  roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
366
  sep_style=SeparatorStyle.MPT,
 
379
  name='phi3-chat',
380
  system_template='<|system|>\n{system_message}',
381
  # note: The new system prompt was not used here to avoid changes in benchmark performance.
382
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室及多家合作单位联合开发的多模态大语言模型。',
383
  system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
384
  roles=('<|user|>\n', '<|assistant|>\n'),
385
  sep_style=SeparatorStyle.MPT,
modeling_intern_vit.py CHANGED
@@ -1,6 +1,6 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  from typing import Optional, Tuple, Union
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  from typing import Optional, Tuple, Union
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 448,
3
+ "do_center_crop": true,
4
+ "do_normalize": true,
5
+ "do_resize": true,
6
+ "feature_extractor_type": "CLIPFeatureExtractor",
7
+ "image_mean": [
8
+ 0.485,
9
+ 0.456,
10
+ 0.406
11
+ ],
12
+ "image_std": [
13
+ 0.229,
14
+ 0.224,
15
+ 0.225
16
+ ],
17
+ "resample": 3,
18
+ "size": 448
19
+ }