OpenGVLab
/

InternVL-Chat-V1-5

@@ -11,19 +11,43 @@ pipeline_tag: visual-question-answering
 ---
 # Model Card for InternVL-Chat-V1.5
-<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300">
-_Two interns holding hands, symbolizing the integration of InternViT and InternLM._
 \[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
-| Model                   | Date       | Download                                                                    | Note                               |
-| ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
-| InternVL-Chat-V1.5      | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)            | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
-| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus)       | more SFT data and stronger  |
-| InternVL-Chat-V1.2      | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2)            | scaling up LLM to 34B       |
-| InternVL-Chat-V1.1      | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)            | support Chinese and stronger OCR   |
 ## Performance
@@ -56,20 +80,6 @@ _Two interns holding hands, symbolizing the integration of InternViT and InternL
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/gqX46Tt5jvrcVqb0vcf06.png)
-## Model Details
-- **Model Type:** multimodal large language model (MLLM)
-- **Model Stats:**
-  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
-  - Image size: dynamic resolution, max to 32 tiles of 448 x 448 (4K resolution).
-  - Params: 25.5B
-- **Training Strategy:**
-  - Pretraining Stage
-    - Learnable Component: ViT + MLP
-    - Data: Please see our technical report.
-  - SFT Stage
-    - Learnable Component: ViT + MLP + LLM
-    - Data: Please see our technical report.
 ## Model Usage

 ---
 # Model Card for InternVL-Chat-V1.5
+<p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
+</p>
+> _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
 \[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
+We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
+We introduce three simple designs:
+1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
+2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 32 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
+3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
+## Model Details
+- **Model Type:** multimodal large language model (MLLM)
+- **Model Stats:**
+  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
+  - Image size: dynamic resolution, max to 32 tiles of 448 x 448 (4K resolution).
+  - Params: 25.5B
+- **Training Strategy:**
+  - Pretraining Stage
+    - Learnable Component: ViT + MLP
+    - Data: Please see our technical report.
+  - SFT Stage
+    - Learnable Component: ViT + MLP + LLM
+    - Data: Please see our technical report.
+| Model                                                      | Vision Foundation Model                                                     | Release Date           |Note                                |
+| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
+| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))      | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5))    |2024.04.18       |          support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
+| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))    |2024.02.21     |        more SFT data and stronger  |
+| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )      |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))     |2024.02.11       |             scaling up LLM to 34B       |
+| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))      |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0))    |2024.01.24         |   support Chinese and stronger OCR   |
 ## Performance
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/gqX46Tt5jvrcVqb0vcf06.png)
 ## Model Usage