Abstract

Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an OmniFusion model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

Community

Great job

Β·
Paper author

Thank you!

Well done, and thanks for sharing this!

Β·
Paper author

Thanks! If any questions, do not hesitate to ask

@kuznetsoffandrey , there is a question on preliminary experiments at small scale that I believe were not covered in this paper. So, one way of ending up into to the specific model architecture setup is to experiment with ablations and data mixtures at a small scale models: 600M/1B/3B (https://arxiv.org/pdf/2403.09611.pdf)
In the case of OmniFusion architecture recipe, have you performed preliminary experiments with even smaller scaled LLMs models besides those one mentioned in paper? (Mistral-7B)

Β·
Paper author

This is a very good point. We did not try it yet, however we have a 3B proprietary model. I think we will share some experiments soon with smaller models

Ну ΠΎΡ‡Π΅Π½ΡŒ awesome!

OmniFusion: Revolutionizing Multimodal AI with Text and Image Integration

Links πŸ”—:

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.06212 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.06212 in a Space README.md to link it from this page.

Collections including this paper 20