license: apache-2.0
datasets:
- laion/laion400m
- kakaobrain/coyo-700m
pipeline_tag: feature-extraction
tags:
- Vision
- LLaVA
Model
We used the same Vision Transformer architecture ViT-L/14@336px as CLIP.
Data
Our model was trained on publicly available image-caption data from the LAION400M and COYO700M datasets.
Performance and Limitations
A. MLLMs Evaluation Results
In our experiments, we replaced the CLIP model in LLaVA-NeXT with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used Qwen2.5-7B. The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
---|---|---|
LLM | Qwen2.5-7B | Qwen2.5-7B |
AI2D | 76.98 | 73.15 |
ScienceQA_img | 78.09 | 76.35 |
GQA | 64.17 | 63.31 |
InfoVQA_val | 43.48 | 38.88 |
MMBench_cn_dev | 74.83 | 72.51 |
MMBench_en_dev | 76.37 | 74.57 |
MME(cognition) | 432 | 384 |
MME(perception) | 1598 | 1512 |
SeedBench | 68.20 | 66.80 |
SeedBench_img | 73.75 | 72.72 |
MMStar | 50.98 | 48.98 |
MMMU | 44.30 | 44.20 |
OCRBench | 531.00 | 525.00 |
ChartQA | 67.84 | 66.52 |
DocVQA_val | 76.46 | 75.21 |
POPE | 88.69 | 88.83 |
TextVQA_val | 61.69 | 62.47 |
B. Linear Probe Evaluation Results
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
---|---|---|
AVG | 87.15 | 85.35 |
Food101 | 96.21 | 95.90 |
CIFAR-10 | 99.36 | 97.90 |
CIFAR-100 | 93.69 | 87.40 |
Birdsnap | 88.18 | 79.90 |
SUN397 | 87.96 | 82.20 |
Stanford Cars | 95.16 | 91.50 |
FGVC Aircraft | 86.38 | 71.60 |
Describable Textures Dataset | 86.70 | 83.00 |
Oxford-IIIT Pets | 96.27 | 95.10 |
Caltech-101 | 97.92 | 96.00 |
Flowers102 | 99.58 | 99.20 |
MNIST | 98.67 | 99.20 |
STL-10 | 99.28 | 99.70 |
EuroSAT | 99.06 | 98.10 |
RESISC45 | 95.48 | 94.90 |
GTSRB | 92.32 | 92.40 |
KITTI | 75.39 | 69.20 |
Country211 | 38.12 | 46.40 |
PatchCamelyon | 88.00 | 85.60 |
UCF101 | 92.86 | 92.00 |
Kinetics-700 | 73.35 | 73.00 |
CLEVR | 64.40 | 60.30 |
Hateful Memes | 72.00 | 77.30 |
SST-2 | 76.33 | 80.50 |
ImageNet | 86.30 | 85.40 |
C. Limitations
Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.
Acknowledgments
We would like to express our gratitude to Xie Yin and Yumeng Wang for their significant contributions to the experimental validation in MLLMs.