Update README.md
Browse files
README.md
CHANGED
@@ -11,25 +11,47 @@ pipeline_tag: visual-question-answering
|
|
11 |
---
|
12 |
|
13 |
# Model Card for InternVL-Chat-V1.2-Plus
|
14 |
-
|
15 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
|
|
|
16 |
|
17 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[ä¸æ–‡è§£è¯»](https://zhuanlan.zhihu.com/p/675877376)]
|
18 |
|
19 |
-
|
20 |
-
| ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
|
21 |
-
| InternVL-Chat-V1.5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
22 |
-
| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) | more SFT data and stronger |
|
23 |
-
| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) | scaling up LLM to 34B |
|
24 |
-
| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) | support Chinese and stronger OCR |
|
25 |
|
|
|
|
|
|
|
26 |
|
27 |
-
##
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
-
|
|
|
|
|
|
|
33 |
|
34 |
\* Proprietary Model †Training Set Observed
|
35 |
|
@@ -49,21 +71,6 @@ InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2]
|
|
49 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
|
50 |
|
51 |
|
52 |
-
## Model Details
|
53 |
-
- **Model Type:** multimodal large language model (MLLM)
|
54 |
-
- **Model Stats:**
|
55 |
-
- Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
|
56 |
-
- Image size: 448 x 448 (256 tokens)
|
57 |
-
- Params: 40B
|
58 |
-
|
59 |
-
- **Training Strategy:**
|
60 |
-
- Pretraining Stage
|
61 |
-
- Learnable Component: MLP
|
62 |
-
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
63 |
-
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
64 |
-
- Supervised Finetuning Stage
|
65 |
-
- Learnable Component: ViT + MLP + LLM
|
66 |
-
- Data: 12 million SFT samples.
|
67 |
|
68 |
|
69 |
## Model Usage
|
|
|
11 |
---
|
12 |
|
13 |
# Model Card for InternVL-Chat-V1.2-Plus
|
14 |
+
<p align="center">
|
15 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
|
16 |
+
</p>
|
17 |
|
18 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[ä¸æ–‡è§£è¯»](https://zhuanlan.zhihu.com/p/675877376)]
|
19 |
|
20 |
+
InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
+
<p align="center">
|
23 |
+
<img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
|
24 |
+
</p>
|
25 |
|
26 |
+
## Model Details
|
27 |
+
- **Model Type:** multimodal large language model (MLLM)
|
28 |
+
- **Model Stats:**
|
29 |
+
- Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
|
30 |
+
- Image size: 448 x 448 (256 tokens)
|
31 |
+
- Params: 40B
|
32 |
+
|
33 |
+
- **Training Strategy:**
|
34 |
+
- Pretraining Stage
|
35 |
+
- Learnable Component: MLP
|
36 |
+
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
37 |
+
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
38 |
+
- Supervised Finetuning Stage
|
39 |
+
- Learnable Component: ViT + MLP + LLM
|
40 |
+
- Data: 12 million SFT samples.
|
41 |
+
|
42 |
+
## Released Models
|
43 |
|
44 |
+
| Model | Vision Foundation Model | Release Date |Note |
|
45 |
+
| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
|
46 |
+
| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
47 |
+
| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
|
48 |
+
| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
|
49 |
+
| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
|
50 |
|
51 |
+
|
52 |
+
|
53 |
+
|
54 |
+
## Performance
|
55 |
|
56 |
\* Proprietary Model †Training Set Observed
|
57 |
|
|
|
71 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
|
72 |
|
73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
|
76 |
## Model Usage
|