Abstract
Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an OmniFusion model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.
Community
Thanks! If any questions, do not hesitate to ask
@kuznetsoffandrey
, there is a question on preliminary experiments at small scale that I believe were not covered in this paper. So, one way of ending up into to the specific model architecture setup is to experiment with ablations and data mixtures at a small scale models: 600M/1B/3B (https://arxiv.org/pdf/2403.09611.pdf)
In the case of OmniFusion architecture recipe, have you performed preliminary experiments with even smaller scaled LLMs models besides those one mentioned in paper? (Mistral-7B)
This is a very good point. We did not try it yet, however we have a 3B proprietary model. I think we will share some experiments soon with smaller models
ΠΡ ΠΎΡΠ΅Π½Ρ awesome!
OmniFusion: Revolutionizing Multimodal AI with Text and Image Integration
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper