Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
# Introduction
|
2 |
|
3 |
-
The Infinity-VL-2B model is a vision-language model (VLM) trained based on the [LLava-one-vision] (https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
|
4 |
|
5 |
The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
|
6 |
|
@@ -10,7 +10,7 @@ We plan to open-source the Infinity-MM dataset, training scripts, and related re
|
|
10 |
|
11 |
We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
|
12 |
|
13 |
-
| Test sets | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Infinity-VL-2B |
|
14 |
|:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
|
15 |
| MMMU\_DEV\_VAL | 39.56 | 34.89 | 43.56 | 41.67 | **45.89** |
|
16 |
| MMStar | 41.6 | 50.2 | 51.87 | 47.8 | **54.4** |
|
|
|
1 |
# Introduction
|
2 |
|
3 |
+
The Infinity-VL-2B-llava-qwen model is a vision-language model (VLM) trained based on the [LLava-one-vision] (https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
|
4 |
|
5 |
The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
|
6 |
|
|
|
10 |
|
11 |
We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
|
12 |
|
13 |
+
| Test sets | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Infinity-VL-2B-llava-qwen |
|
14 |
|:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
|
15 |
| MMMU\_DEV\_VAL | 39.56 | 34.89 | 43.56 | 41.67 | **45.89** |
|
16 |
| MMStar | 41.6 | 50.2 | 51.87 | 47.8 | **54.4** |
|