Update README.md
Browse files
README.md
CHANGED
@@ -11,19 +11,43 @@ pipeline_tag: visual-question-answering
|
|
11 |
---
|
12 |
|
13 |
# Model Card for InternVL-Chat-V1.5
|
|
|
|
|
|
|
14 |
|
15 |
-
|
16 |
-
|
17 |
-
_Two interns holding hands, symbolizing the integration of InternViT and InternLM._
|
18 |
|
19 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[ä¸æ–‡è§£è¯»](https://zhuanlan.zhihu.com/p/675877376)]
|
20 |
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
## Performance
|
29 |
|
@@ -56,20 +80,6 @@ _Two interns holding hands, symbolizing the integration of InternViT and InternL
|
|
56 |
|
57 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/gqX46Tt5jvrcVqb0vcf06.png)
|
58 |
|
59 |
-
## Model Details
|
60 |
-
- **Model Type:** multimodal large language model (MLLM)
|
61 |
-
- **Model Stats:**
|
62 |
-
- Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
|
63 |
-
- Image size: dynamic resolution, max to 32 tiles of 448 x 448 (4K resolution).
|
64 |
-
- Params: 25.5B
|
65 |
-
|
66 |
-
- **Training Strategy:**
|
67 |
-
- Pretraining Stage
|
68 |
-
- Learnable Component: ViT + MLP
|
69 |
-
- Data: Please see our technical report.
|
70 |
-
- SFT Stage
|
71 |
-
- Learnable Component: ViT + MLP + LLM
|
72 |
-
- Data: Please see our technical report.
|
73 |
|
74 |
|
75 |
## Model Usage
|
|
|
11 |
---
|
12 |
|
13 |
# Model Card for InternVL-Chat-V1.5
|
14 |
+
<p align="center">
|
15 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
|
16 |
+
</p>
|
17 |
|
18 |
+
> _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
|
|
|
|
|
19 |
|
20 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[ä¸æ–‡è§£è¯»](https://zhuanlan.zhihu.com/p/675877376)]
|
21 |
|
22 |
+
We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
|
23 |
+
We introduce three simple designs:
|
24 |
+
1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
|
25 |
+
2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 32 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
|
26 |
+
3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
|
27 |
+
|
28 |
+
|
29 |
+
## Model Details
|
30 |
+
- **Model Type:** multimodal large language model (MLLM)
|
31 |
+
- **Model Stats:**
|
32 |
+
- Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
|
33 |
+
- Image size: dynamic resolution, max to 32 tiles of 448 x 448 (4K resolution).
|
34 |
+
- Params: 25.5B
|
35 |
+
|
36 |
+
- **Training Strategy:**
|
37 |
+
- Pretraining Stage
|
38 |
+
- Learnable Component: ViT + MLP
|
39 |
+
- Data: Please see our technical report.
|
40 |
+
- SFT Stage
|
41 |
+
- Learnable Component: ViT + MLP + LLM
|
42 |
+
- Data: Please see our technical report.
|
43 |
+
|
44 |
+
|
45 |
+
| Model | Vision Foundation Model | Release Date |Note |
|
46 |
+
| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
|
47 |
+
| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
48 |
+
| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
|
49 |
+
| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
|
50 |
+
| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
|
51 |
|
52 |
## Performance
|
53 |
|
|
|
80 |
|
81 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/gqX46Tt5jvrcVqb0vcf06.png)
|
82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
|
84 |
|
85 |
## Model Usage
|