--- license: apache-2.0 datasets: - laion/laion400m - kakaobrain/coyo-700m pipeline_tag: feature-extraction tags: - Vision - LLaVA --- [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom) ## Model We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png) ## Data Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets. ## Performance and Limitations ### A. MLLMs Evaluation Results In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs. | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) | |:----------------|:----------------------|:----------------------| | LLM | Qwen2.5-7B | Qwen2.5-7B | | AI2D | 76.98 | 73.15 | | ScienceQA_img | 78.09 | 76.35 | | GQA | 64.17 | 63.31 | | InfoVQA_val | 43.48 | 38.88 | | MMBench_cn_dev | 74.83 | 72.51 | | MMBench_en_dev | 76.37 | 74.57 | | MME(cognition) | 432 | 384 | | MME(perception) | 1598 | 1512 | | SeedBench | 68.20 | 66.80 | | SeedBench_img | 73.75 | 72.72 | | MMStar | 50.98 | 48.98 | | MMMU | 44.30 | 44.20 | | OCRBench | 531.00 | 525.00 | | ChartQA | 67.84 | 66.52 | | DocVQA_val | 76.46 | 75.21 | | POPE | 88.69 | 88.83 | | TextVQA_val | 61.69 | 62.47 | ### B. Linear Probe Evaluation Results This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks. | Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) | |:---------------|:----------------------|:----------------------| | AVG | 87.15 | 85.35 | | Food101 | 96.21 | 95.90 | | CIFAR-10 | 99.36 | 97.90 | | CIFAR-100 | 93.69 | 87.40 | | Birdsnap | 88.18 | 79.90 | | SUN397 | 87.96 | 82.20 | | Stanford Cars | 95.16 | 91.50 | | FGVC Aircraft | 86.38 | 71.60 | | Describable Textures Dataset | 86.70 | 83.00 | | Oxford-IIIT Pets | 96.27 | 95.10 | | Caltech-101 | 97.92 | 96.00 | | Flowers102 | 99.58 | 99.20 | | MNIST | 98.67 | 99.20 | | STL-10 | 99.28 | 99.70 | | EuroSAT | 99.06 | 98.10 | | RESISC45 | 95.48 | 94.90 | | GTSRB | 92.32 | 92.40 | | KITTI | 75.39 | 69.20 | | Country211 | 38.12 | 46.40 | | PatchCamelyon | 88.00 | 85.60 | | UCF101 | 92.86 | 92.00 | | Kinetics-700 | 73.35 | 73.00 | | CLEVR | 64.40 | 60.30 | | Hateful Memes | 72.00 | 77.30 | | SST-2 | 76.33 | 80.50 | | ImageNet | 86.30 | 85.40 | ### C. Limitations Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available. ## Acknowledgments We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.