BAAI
/

Aquila-VL-2B-llava-qwen

@@ -1,6 +1,16 @@
 # Introduction
-The Infinity-VL-2B-llava-qwen model is a vision-language model (VLM) trained based on the [LLava-one-vision] (https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
 The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
@@ -8,9 +18,9 @@ We plan to open-source the Infinity-MM dataset, training scripts, and related re
 # Evaluation
-We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
-| Test sets       |   MiniCPM-V-2 |   InternVL2-2B |   XinYuan-VL-2B |   Qwen2-VL-2B-Instruct |   Infinity-VL-2B-llava-qwen |
 |:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
 | MMMU\_DEV\_VAL    |         39.56 |          34.89 |           43.56 |                  41.67 |            **45.89** |
 | MMStar          |         41.6  |          50.2  |           51.87 |                  47.8  |            **54.4**  |
@@ -18,7 +28,7 @@ We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit:
 | MathVista\_MINI  |         39    |          45    |           47.1  |                  47.9  |            **57.8**  |
 | HallusionBench  |         36.83 |          38.06 |           36.03 |                  41.52 |            **42.64** |
 | OCRBench        |        613    |         784    |          782    |                **810**  |           776    |
-| AI2\_TEST       |         64.8  |          74.38 |           74.22 |                  **74.64** |            74.38 |
 | MMVet           |         44.04 |          41.1  |           42.66 |                  **50.73** |            44.27 |
 | DocVQA\_TEST     |         71.02 |          86.87 |           87.63 |                  **89.87** |           **76.56**    |
 | ChartQA\_TEST    |         59.64 |          71.4  |           57.08 |                  73.52 |            76.56 |
@@ -30,7 +40,7 @@ We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit:
 | MMT-Bench\_ALL   |         54.46 |          53.31 |           **57.24** |                  54.78 |           56.19    |
 | MathVision      |         15.43 |          12.6  |           16.32 |                  17.47 |            **18.52** |
 | OCRVQA\_TESTCORE |         54.43 |          40.23 |           67.64 |                  **68.68** |            63.83 |
-|Average| 52.09	| 57.79	|60.68	| 61.96	|**62.92** |

+---
+license: apache-2.0
+language:
+- en
+tags:
+- multimodal
+library_name: transformers
+---
 # Introduction
+The Infinity-VL-2B model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
 The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
 # Evaluation
+We evaluated the model using the [VLMEvalKit](GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
+| Test sets       |   MiniCPM-V-2 |   InternVL2-2B |   XinYuan-VL-2B |   Qwen2-VL-2B-Instruct |   Infinity-VL-2B |
 |:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
 | MMMU\_DEV\_VAL    |         39.56 |          34.89 |           43.56 |                  41.67 |            **45.89** |
 | MMStar          |         41.6  |          50.2  |           51.87 |                  47.8  |            **54.4**  |
 | MathVista\_MINI  |         39    |          45    |           47.1  |                  47.9  |            **57.8**  |
 | HallusionBench  |         36.83 |          38.06 |           36.03 |                  41.52 |            **42.64** |
 | OCRBench        |        613    |         784    |          782    |                **810**  |           776    |
+| AI2D\_TEST       |         64.8  |          74.38 |           74.22 |                  **74.64** |            74.38 |
 | MMVet           |         44.04 |          41.1  |           42.66 |                  **50.73** |            44.27 |
 | DocVQA\_TEST     |         71.02 |          86.87 |           87.63 |                  **89.87** |           **76.56**    |
 | ChartQA\_TEST    |         59.64 |          71.4  |           57.08 |                  73.52 |            76.56 |
 | MMT-Bench\_ALL   |         54.46 |          53.31 |           **57.24** |                  54.78 |           56.19    |
 | MathVision      |         15.43 |          12.6  |           16.32 |                  17.47 |            **18.52** |
 | OCRVQA\_TESTCORE |         54.43 |          40.23 |           67.64 |                  **68.68** |            63.83 |
+|Average| 52.09        | 57.79        |60.68        | 61.96        |**62.92** |