BAAI
/

Aquila-VL-2B-llava-qwen

Image-Text-to-Text

text-generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ldwang commited on 15 days ago

Commit

f10ef61

•

1 Parent(s): 59e7c4a

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Introduction
-The Infinity-VL-2B model is a vision-language model (VLM) trained based on the [LLava-one-vision] (https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
 The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
@@ -10,7 +10,7 @@ We plan to open-source the Infinity-MM dataset, training scripts, and related re
 We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
-| Test sets       |   MiniCPM-V-2 |   InternVL2-2B |   XinYuan-VL-2B |   Qwen2-VL-2B-Instruct |   Infinity-VL-2B |
 |:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
 | MMMU\_DEV\_VAL    |         39.56 |          34.89 |           43.56 |                  41.67 |            **45.89** |
 | MMStar          |         41.6  |          50.2  |           51.87 |                  47.8  |            **54.4**  |

 # Introduction
+The Infinity-VL-2B-llava-qwen model is a vision-language model (VLM) trained based on the [LLava-one-vision] (https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
 The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
 We evaluated the model using the [VLMEvalKit] (GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.
+| Test sets       |   MiniCPM-V-2 |   InternVL2-2B |   XinYuan-VL-2B |   Qwen2-VL-2B-Instruct |   Infinity-VL-2B-llava-qwen |
 |:----------------:|:--------------:|:---------------:|:----------------:|:-----------------------:|:-----------------:|
 | MMMU\_DEV\_VAL    |         39.56 |          34.89 |           43.56 |                  41.67 |            **45.89** |
 | MMStar          |         41.6  |          50.2  |           51.87 |                  47.8  |            **54.4**  |