Update README.md

bd75627 verified 20 days ago

6.34 kB

	---
	license: apache-2.0
	datasets:
	- laion/laion400m
	- kakaobrain/coyo-700m
	pipeline_tag: feature-extraction
	tags:
	- Vision
	- LLaVA
	---




	[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
	## Model
	We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)


	## Data
	Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.

	## Performance and Limitations

	### A. MLLMs Evaluation Results
	In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

	\| Vision Tower \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:----------------\|:----------------------\|:----------------------\|
	\| LLM \| Qwen2.5-7B \| Qwen2.5-7B \|
	\| AI2D \| <span style="color:red">76.98</span> \| 73.15 \|
	\| ScienceQA_img \| <span style="color:red">78.09</span> \| 76.35 \|
	\| GQA \| <span style="color:red">64.17</span> \| 63.31 \|
	\| InfoVQA_val \| <span style="color:red">43.48</span> \| 38.88 \|
	\| MMBench_cn_dev \| <span style="color:red">74.83</span> \| 72.51 \|
	\| MMBench_en_dev \| <span style="color:red">76.37</span> \| 74.57 \|
	\| MME(cognition) \| <span style="color:red">432</span> \| 384 \|
	\| MME(perception) \| <span style="color:red">1598</span> \| 1512 \|
	\| SeedBench \| <span style="color:red">68.20</span> \| 66.80 \|
	\| SeedBench_img \| <span style="color:red">73.75</span> \| 72.72 \|
	\| MMStar \| <span style="color:red">50.98</span> \| 48.98 \|
	\| MMMU \| <span style="color:red">44.30</span> \| 44.20 \|
	\| OCRBench \| <span style="color:red">531.00</span> \| 525.00 \|
	\| ChartQA \| <span style="color:red">67.84</span> \| 66.52 \|
	\| DocVQA_val \| <span style="color:red">76.46</span> \| 75.21 \|
	\| POPE \| 88.69 \| <span style="color:red">88.83</span> \|
	\| TextVQA_val \| 61.69 \| <span style="color:red">62.47</span> \|




	### B. Linear Probe Evaluation Results
	This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

	\| Dataset \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:---------------\|:----------------------\|:----------------------\|
	\| AVG \| <span style="color:red">87.15</span> \| 85.35 \|
	\| Food101 \| <span style="color:red">96.21</span> \| 95.90 \|
	\| CIFAR-10 \| <span style="color:red">99.36</span> \| 97.90 \|
	\| CIFAR-100 \| <span style="color:red">93.69</span> \| 87.40 \|
	\| Birdsnap \| <span style="color:red">88.18</span> \| 79.90 \|
	\| SUN397 \| <span style="color:red">87.96</span> \| 82.20 \|
	\| Stanford Cars \| <span style="color:red">95.16</span> \| 91.50 \|
	\| FGVC Aircraft \| <span style="color:red">86.38</span> \| 71.60 \|
	\| Describable Textures Dataset \| <span style="color:red">86.70</span> \| 83.00 \|
	\| Oxford-IIIT Pets \| <span style="color:red">96.27</span> \| 95.10 \|
	\| Caltech-101 \| <span style="color:red">97.92</span> \| 96.00 \|
	\| Flowers102 \| <span style="color:red">99.58</span> \| 99.20 \|
	\| MNIST \| 98.67 \| <span style="color:red">99.20</span> \|
	\| STL-10 \| 99.28 \| <span style="color:red">99.70</span> \|
	\| EuroSAT \| <span style="color:red">99.06</span> \| 98.10 \|
	\| RESISC45 \| <span style="color:red">95.48</span> \| 94.90 \|
	\| GTSRB \| 92.32 \| <span style="color:red">92.40</span> \|
	\| KITTI \| <span style="color:red">75.39</span> \| 69.20 \|
	\| Country211 \| 38.12 \| <span style="color:red">46.40</span> \|
	\| PatchCamelyon \| <span style="color:red">88.00</span> \| 85.60 \|
	\| UCF101 \| <span style="color:red">92.86</span> \| 92.00 \|
	\| Kinetics-700 \| <span style="color:red">73.35</span> \| 73.00 \|
	\| CLEVR \| <span style="color:red">64.40</span> \| 60.30 \|
	\| Hateful Memes \| 72.00 \| <span style="color:red">77.30</span> \|
	\| SST-2 \| 76.33 \| <span style="color:red">80.50</span> \|
	\| ImageNet \| <span style="color:red">86.30</span> \| 85.40 \|


	### C. Limitations

	Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.


	## Acknowledgments

	We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.

	---
	license: apache-2.0
	datasets:
	- laion/laion400m
	- kakaobrain/coyo-700m
	pipeline_tag: feature-extraction
	tags:
	- Vision
	- LLaVA
	---




	[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
	## Model
	We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)


	## Data
	Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.

	## Performance and Limitations

	### A. MLLMs Evaluation Results
	In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

	\| Vision Tower \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:----------------\|:----------------------\|:----------------------\|
	\| LLM \| Qwen2.5-7B \| Qwen2.5-7B \|
	\| AI2D \| <span style="color:red">76.98</span> \| 73.15 \|
	\| ScienceQA_img \| <span style="color:red">78.09</span> \| 76.35 \|
	\| GQA \| <span style="color:red">64.17</span> \| 63.31 \|
	\| InfoVQA_val \| <span style="color:red">43.48</span> \| 38.88 \|
	\| MMBench_cn_dev \| <span style="color:red">74.83</span> \| 72.51 \|
	\| MMBench_en_dev \| <span style="color:red">76.37</span> \| 74.57 \|
	\| MME(cognition) \| <span style="color:red">432</span> \| 384 \|
	\| MME(perception) \| <span style="color:red">1598</span> \| 1512 \|
	\| SeedBench \| <span style="color:red">68.20</span> \| 66.80 \|
	\| SeedBench_img \| <span style="color:red">73.75</span> \| 72.72 \|
	\| MMStar \| <span style="color:red">50.98</span> \| 48.98 \|
	\| MMMU \| <span style="color:red">44.30</span> \| 44.20 \|
	\| OCRBench \| <span style="color:red">531.00</span> \| 525.00 \|
	\| ChartQA \| <span style="color:red">67.84</span> \| 66.52 \|
	\| DocVQA_val \| <span style="color:red">76.46</span> \| 75.21 \|
	\| POPE \| 88.69 \| <span style="color:red">88.83</span> \|
	\| TextVQA_val \| 61.69 \| <span style="color:red">62.47</span> \|




	### B. Linear Probe Evaluation Results
	This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

	\| Dataset \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:---------------\|:----------------------\|:----------------------\|
	\| AVG \| <span style="color:red">87.15</span> \| 85.35 \|
	\| Food101 \| <span style="color:red">96.21</span> \| 95.90 \|
	\| CIFAR-10 \| <span style="color:red">99.36</span> \| 97.90 \|
	\| CIFAR-100 \| <span style="color:red">93.69</span> \| 87.40 \|
	\| Birdsnap \| <span style="color:red">88.18</span> \| 79.90 \|
	\| SUN397 \| <span style="color:red">87.96</span> \| 82.20 \|
	\| Stanford Cars \| <span style="color:red">95.16</span> \| 91.50 \|
	\| FGVC Aircraft \| <span style="color:red">86.38</span> \| 71.60 \|
	\| Describable Textures Dataset \| <span style="color:red">86.70</span> \| 83.00 \|
	\| Oxford-IIIT Pets \| <span style="color:red">96.27</span> \| 95.10 \|
	\| Caltech-101 \| <span style="color:red">97.92</span> \| 96.00 \|
	\| Flowers102 \| <span style="color:red">99.58</span> \| 99.20 \|
	\| MNIST \| 98.67 \| <span style="color:red">99.20</span> \|
	\| STL-10 \| 99.28 \| <span style="color:red">99.70</span> \|
	\| EuroSAT \| <span style="color:red">99.06</span> \| 98.10 \|
	\| RESISC45 \| <span style="color:red">95.48</span> \| 94.90 \|
	\| GTSRB \| 92.32 \| <span style="color:red">92.40</span> \|
	\| KITTI \| <span style="color:red">75.39</span> \| 69.20 \|
	\| Country211 \| 38.12 \| <span style="color:red">46.40</span> \|
	\| PatchCamelyon \| <span style="color:red">88.00</span> \| 85.60 \|
	\| UCF101 \| <span style="color:red">92.86</span> \| 92.00 \|
	\| Kinetics-700 \| <span style="color:red">73.35</span> \| 73.00 \|
	\| CLEVR \| <span style="color:red">64.40</span> \| 60.30 \|
	\| Hateful Memes \| 72.00 \| <span style="color:red">77.30</span> \|
	\| SST-2 \| 76.33 \| <span style="color:red">80.50</span> \|
	\| ImageNet \| <span style="color:red">86.30</span> \| 85.40 \|


	### C. Limitations

	Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.


	## Acknowledgments

	We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.