Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ datasets:
|
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
-
# Model Card for InternVL-Chat-V1
|
14 |
<p align="center">
|
15 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
|
16 |
</p>
|
@@ -19,7 +19,7 @@ pipeline_tag: visual-question-answering
|
|
19 |
|
20 |
[\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
|
21 |
|
22 |
-
We are excited to introduce InternVL-Chat-V1
|
23 |
|
24 |
<p align="center">
|
25 |
<img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
|
@@ -50,10 +50,10 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
50 |
|
51 |
| Model | Vision Foundation Model | Release Date |Note |
|
52 |
| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
|
53 |
-
| InternVL-Chat-V1
|
54 |
-
| InternVL-Chat-V1
|
55 |
-
| InternVL-Chat-V1
|
56 |
-
| InternVL-Chat-V1
|
57 |
|
58 |
|
59 |
|
@@ -70,9 +70,9 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
70 |
| Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
|
71 |
| | | | | | | | | | | | | | | |
|
72 |
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
|
73 |
-
| InternVL−Chat−V1
|
74 |
|
75 |
-
- In most benchmarks, InternVL-Chat-V1
|
76 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
|
77 |
|
78 |
|
@@ -80,7 +80,7 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
80 |
|
81 |
### Data Preparation
|
82 |
|
83 |
-
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1
|
84 |
|
85 |
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
86 |
|
@@ -95,14 +95,14 @@ The hyperparameters used for finetuning are listed in the following table.
|
|
95 |
|
96 |
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|
97 |
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
98 |
-
| InternVL−Chat−V1
|
99 |
|
100 |
|
101 |
|
102 |
|
103 |
## Model Usage
|
104 |
|
105 |
-
We provide an example code to run InternVL-Chat-V1
|
106 |
|
107 |
You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
|
108 |
|
|
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
+
# Model Card for InternVL-Chat-V1-2
|
14 |
<p align="center">
|
15 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
|
16 |
</p>
|
|
|
19 |
|
20 |
[\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
|
21 |
|
22 |
+
We are excited to introduce InternVL-Chat-V1-2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
|
23 |
|
24 |
<p align="center">
|
25 |
<img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
|
|
|
50 |
|
51 |
| Model | Vision Foundation Model | Release Date |Note |
|
52 |
| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
|
53 |
+
| InternVL-Chat-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
54 |
+
| InternVL-Chat-V1-2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
|
55 |
+
| InternVL-Chat-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
|
56 |
+
| InternVL-Chat-V1-1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
|
57 |
|
58 |
|
59 |
|
|
|
70 |
| Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
|
71 |
| | | | | | | | | | | | | | | |
|
72 |
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
|
73 |
+
| InternVL−Chat−V1-2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
|
74 |
|
75 |
+
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
76 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
|
77 |
|
78 |
|
|
|
80 |
|
81 |
### Data Preparation
|
82 |
|
83 |
+
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
|
84 |
|
85 |
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
86 |
|
|
|
95 |
|
96 |
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|
97 |
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
98 |
+
| InternVL−Chat−V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
|
99 |
|
100 |
|
101 |
|
102 |
|
103 |
## Model Usage
|
104 |
|
105 |
+
We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
|
106 |
|
107 |
You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
|
108 |
|