BAAI
/

ldwang's picture
Update README.md
84abdcc verified
|
raw
history blame
4.71 kB
metadata
license: apache-2.0
language:
  - en
tags:
  - multimodal
library_name: transformers

Introduction

The Infinity-VL-2B model is a vision-language model (VLM) trained based on the LLava-one-vision framework. The Qwen2.5-1.5B-instruct model is chose as the LLM, while siglip-so400m-patch14-384 is utilized as the vision tower.

The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.

We plan to open-source the Infinity-MM dataset, training scripts, and related resources in the near future. For more technical details, stay tuned for our upcoming technical report.

Evaluation

We evaluated the model using the [VLMEvalKit](GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LV) tool. Whenever possible, we prioritized using the GPT-4 API for test sets that support API-based evaluation.

Test sets MiniCPM-V-2 InternVL2-2B XinYuan-VL-2B Qwen2-VL-2B-Instruct Infinity-VL-2B
MMMU_DEV_VAL 39.56 34.89 43.56 41.67 45.89
MMStar 41.6 50.2 51.87 47.8 54.4
MMBench_V11 65.2 69.72 75.41 72.7 72.63
MathVista_MINI 39 45 47.1 47.9 57.8
HallusionBench 36.83 38.06 36.03 41.52 42.64
OCRBench 613 784 782 810 776
AI2D_TEST 64.8 74.38 74.22 74.64 74.38
MMVet 44.04 41.1 42.66 50.73 44.27
DocVQA_TEST 71.02 86.87 87.63 89.87 76.56
ChartQA_TEST 59.64 71.4 57.08 73.52 76.56
TextVQA_VAL 74.3 73.49 77.61 79.9 76.13
VCR_EN_EASY_ALL 27.61 51.59 67.71 68.26 73.33
RealWorldQA 55.42 57.25 63.92 62.61 64.71
MMBench_TEST_EN 69.39 73.37 78.87 74.94 77.75
MMBench_TEST_CN 65.86 70.85 76.12 73.93 72.25
MMT-Bench_ALL 54.46 53.31 57.24 54.78 56.19
MathVision 15.43 12.6 16.32 17.47 18.52
OCRVQA_TESTCORE 54.43 40.23 67.64 68.68 63.83
Average 52.09 57.79 60.68 61.96 62.92

For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers or on the official VLMEvalKit leaderboard.

Future Plan

  • We plan to train models of various sizes.
  • Future training will incorporate multi-image and video data.
  • We will open-source the Infinity-MM dataset and training code.
  • A comprehensive technical report will be released.

Disclaimer

The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced the model is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.