czczup commited on
Commit
de2e232
1 Parent(s): 15083a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -14
README.md CHANGED
@@ -10,12 +10,22 @@ datasets:
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
- # Model Card for InternVL-Chat-Chinese-V1.2-Plus
14
 
15
- \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
16
 
 
17
 
18
- InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
 
 
 
 
 
 
 
 
 
19
 
20
  <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
21
 
@@ -32,46 +42,43 @@ InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2]
32
  | Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
33
  | | | | | | | | | | | | | | | |
34
  | LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1† |
35
- | InternVL−Chat−V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1672/509 | 83.3 | 88.0 | 69.7 | 75.6 | 60.0 | 64.0† |
36
- | InternVL−Chat−V1.2−Plus | 448x448 | 50.3 | 45.6 | 59.9 | 83.8 | 82.0 | 58.7 | 1624/551 | 98.1† | 88.7 | 71.3† | 76.4 | - | 66.9† |
37
 
38
  - MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
 
39
 
40
 
41
  ## Model Details
42
- - **Model Type:** vision large language model, multimodal chatbot
43
  - **Model Stats:**
44
  - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
 
45
  - Params: 40B
46
- - Image size: 448 x 448
47
- - Number of visual tokens: 256
48
 
49
  - **Training Strategy:**
50
  - Pretraining Stage
51
  - Learnable Component: MLP
52
  - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
53
  - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
54
- - SFT Stage
55
  - Learnable Component: ViT + MLP + LLM
56
  - Data: 12 million SFT samples.
57
 
58
 
59
  ## Model Usage
60
 
61
- We provide a minimum code example to run InternVL-Chat using only the `transformers` library.
62
 
63
  You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
64
 
65
- Note: If you meet this error `ImportError: This modeling file requires the following packages that were not found in your environment: fastchat`, please run `pip install fschat`.
66
-
67
-
68
  ```python
69
  import torch
70
  from PIL import Image
71
  from transformers import AutoModel, CLIPImageProcessor
72
  from transformers import AutoTokenizer
73
 
74
- path = "OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus"
75
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
76
  model = AutoModel.from_pretrained(
77
  path,
 
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
+ # Model Card for InternVL-Chat-V1.2-Plus
14
 
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
16
 
17
+ \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
18
 
19
+ | Model | Date | Download | Note |
20
+ | ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
21
+ | InternVL-Chat-V1.5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
22
+ | InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) | more SFT data and stronger |
23
+ | InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) | scaling up LLM to 34B |
24
+ | InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) | support Chinese and stronger OCR |
25
+
26
+
27
+ ## InternVL-Chat-V1.2-Plus Blog
28
+ InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
29
 
30
  <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
31
 
 
42
  | Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
43
  | | | | | | | | | | | | | | | |
44
  | LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1† |
45
+ | InternVL−Chat−V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0† |
46
+ | InternVL−Chat−V1.2−Plus | 448x448 | 50.3 | 45.6 | 59.9 | 83.8 | 82.0 | 58.7 | 1625/553 | 98.1† | 88.7 | 74.1† | 76.4 | - | 66.9† |
47
 
48
  - MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
49
+ - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
50
 
51
 
52
  ## Model Details
53
+ - **Model Type:** multimodal large language model (MLLM)
54
  - **Model Stats:**
55
  - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
56
+ - Image size: 448 x 448 (256 tokens)
57
  - Params: 40B
 
 
58
 
59
  - **Training Strategy:**
60
  - Pretraining Stage
61
  - Learnable Component: MLP
62
  - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
63
  - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
64
+ - Supervised Finetuning Stage
65
  - Learnable Component: ViT + MLP + LLM
66
  - Data: 12 million SFT samples.
67
 
68
 
69
  ## Model Usage
70
 
71
+ We provide an example code to run InternVL-Chat-V1.2 using `transformers`.
72
 
73
  You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
74
 
 
 
 
75
  ```python
76
  import torch
77
  from PIL import Image
78
  from transformers import AutoModel, CLIPImageProcessor
79
  from transformers import AutoTokenizer
80
 
81
+ path = "OpenGVLab/InternVL-Chat-V1-2-Plus"
82
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
83
  model = AutoModel.from_pretrained(
84
  path,