BAAI
/

ldwang commited on
Commit
84abdcc
1 Parent(s): f10ef61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -5
README.md CHANGED
@@ -1,6 +1,16 @@
 
 
 
 
 
 
 
 
 
 
1
  # Introduction
2
 
3
- The Infinity-VL-2B-llava-qwen model is a vision-language model (VLM) trained based on the [LLava-one-vision] (https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
4
 
5
  The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
6
 
@@ -8,9 +18,9 @@ We plan to open-source the Infinity-MM dataset, training scripts, and related re
8
 
9
  # Evaluation
10
 
11
- We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
12
 
13
- | Test sets | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Infinity-VL-2B-llava-qwen |
14
  |:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
15
  | MMMU\_DEV\_VAL | 39.56 | 34.89 | 43.56 | 41.67 | **45.89** |
16
  | MMStar | 41.6 | 50.2 | 51.87 | 47.8 | **54.4** |
@@ -18,7 +28,7 @@ We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit:
18
  | MathVista\_MINI | 39 | 45 | 47.1 | 47.9 | **57.8** |
19
  | HallusionBench | 36.83 | 38.06 | 36.03 | 41.52 | **42.64** |
20
  | OCRBench | 613 | 784 | 782 | **810** | 776 |
21
- | AI2\_TEST | 64.8 | 74.38 | 74.22 | **74.64** | 74.38 |
22
  | MMVet | 44.04 | 41.1 | 42.66 | **50.73** | 44.27 |
23
  | DocVQA\_TEST | 71.02 | 86.87 | 87.63 | **89.87** | **76.56** |
24
  | ChartQA\_TEST | 59.64 | 71.4 | 57.08 | 73.52 | 76.56 |
@@ -30,7 +40,7 @@ We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit:
30
  | MMT-Bench\_ALL | 54.46 | 53.31 | **57.24** | 54.78 | 56.19 |
31
  | MathVision | 15.43 | 12.6 | 16.32 | 17.47 | **18.52** |
32
  | OCRVQA\_TESTCORE | 54.43 | 40.23 | 67.64 | **68.68** | 63.83 |
33
- |Average| 52.09 | 57.79 |60.68 | 61.96 |**62.92** |
34
 
35
 
36
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - multimodal
7
+ library_name: transformers
8
+ ---
9
+
10
+
11
  # Introduction
12
 
13
+ The Infinity-VL-2B model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
14
 
15
  The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
16
 
 
18
 
19
  # Evaluation
20
 
21
+ We evaluated the model using the [VLMEvalKit](GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
22
 
23
+ | Test sets | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Infinity-VL-2B |
24
  |:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
25
  | MMMU\_DEV\_VAL | 39.56 | 34.89 | 43.56 | 41.67 | **45.89** |
26
  | MMStar | 41.6 | 50.2 | 51.87 | 47.8 | **54.4** |
 
28
  | MathVista\_MINI | 39 | 45 | 47.1 | 47.9 | **57.8** |
29
  | HallusionBench | 36.83 | 38.06 | 36.03 | 41.52 | **42.64** |
30
  | OCRBench | 613 | 784 | 782 | **810** | 776 |
31
+ | AI2D\_TEST | 64.8 | 74.38 | 74.22 | **74.64** | 74.38 |
32
  | MMVet | 44.04 | 41.1 | 42.66 | **50.73** | 44.27 |
33
  | DocVQA\_TEST | 71.02 | 86.87 | 87.63 | **89.87** | **76.56** |
34
  | ChartQA\_TEST | 59.64 | 71.4 | 57.08 | 73.52 | 76.56 |
 
40
  | MMT-Bench\_ALL | 54.46 | 53.31 | **57.24** | 54.78 | 56.19 |
41
  | MathVision | 15.43 | 12.6 | 16.32 | 17.47 | **18.52** |
42
  | OCRVQA\_TESTCORE | 54.43 | 40.23 | 67.64 | **68.68** | 63.83 |
43
+ |Average| 52.09 | 57.79 |60.68 | 61.96 |**62.92** |
44
 
45
 
46