zwgao commited on
Commit
973b6d6
•
1 Parent(s): b83ee00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -27
README.md CHANGED
@@ -11,25 +11,47 @@ pipeline_tag: visual-question-answering
11
  ---
12
 
13
  # Model Card for InternVL-Chat-V1.2-Plus
14
-
15
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
 
16
 
17
  \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
18
 
19
- | Model | Date | Download | Note |
20
- | ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
21
- | InternVL-Chat-V1.5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
22
- | InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) | more SFT data and stronger |
23
- | InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) | scaling up LLM to 34B |
24
- | InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) | support Chinese and stronger OCR |
25
 
 
 
 
26
 
27
- ## InternVL-Chat-V1.2-Plus Blog
28
- InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
 
 
 
 
 
31
 
32
- ### Performance
 
 
 
33
 
34
  \* Proprietary Model &nbsp;&nbsp;&nbsp;&nbsp; † Training Set Observed
35
 
@@ -49,21 +71,6 @@ InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2]
49
  - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
50
 
51
 
52
- ## Model Details
53
- - **Model Type:** multimodal large language model (MLLM)
54
- - **Model Stats:**
55
- - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
56
- - Image size: 448 x 448 (256 tokens)
57
- - Params: 40B
58
-
59
- - **Training Strategy:**
60
- - Pretraining Stage
61
- - Learnable Component: MLP
62
- - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
63
- - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
64
- - Supervised Finetuning Stage
65
- - Learnable Component: ViT + MLP + LLM
66
- - Data: 12 million SFT samples.
67
 
68
 
69
  ## Model Usage
 
11
  ---
12
 
13
  # Model Card for InternVL-Chat-V1.2-Plus
14
+ <p align="center">
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
16
+ </p>
17
 
18
  \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
19
 
20
+ InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
 
 
 
 
 
21
 
22
+ <p align="center">
23
+ <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
24
+ </p>
25
 
26
+ ## Model Details
27
+ - **Model Type:** multimodal large language model (MLLM)
28
+ - **Model Stats:**
29
+ - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
30
+ - Image size: 448 x 448 (256 tokens)
31
+ - Params: 40B
32
+
33
+ - **Training Strategy:**
34
+ - Pretraining Stage
35
+ - Learnable Component: MLP
36
+ - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
37
+ - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
38
+ - Supervised Finetuning Stage
39
+ - Learnable Component: ViT + MLP + LLM
40
+ - Data: 12 million SFT samples.
41
+
42
+ ## Released Models
43
 
44
+ | Model | Vision Foundation Model | Release Date |Note |
45
+ | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
46
+ | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
47
+ | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
48
+ | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
49
+ | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
50
 
51
+
52
+
53
+
54
+ ## Performance
55
 
56
  \* Proprietary Model &nbsp;&nbsp;&nbsp;&nbsp; † Training Set Observed
57
 
 
71
  - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
72
 
73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
 
76
  ## Model Usage