MoAI: Mixture of All Intelligence for Large Language and Vision Models
Abstract
The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.
Community
Interesting architectural choices:
Have you tried just feeding in the auxiliary feature into the LLM without compression? Is the compression of all auxiliary features down to just 64 tokens mainly for efficiency?
Have you tried just concatenating all of the features together (marked by begin/end boundary tokens for e.g.) instead of using cross attention in each of the "experts"?
Thanks for interests in our work and for insightful questions!
According to the paper "Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study", we gained the insight that just feeding auxiliary information without compression leads to peformance degradation. We think that it is becuase wrong information get from external computer vision models seems to damage outputs of MoAI directly. As you know, not only external CV models but also other models are not perfect models to predict completely due to many reasons. Therefore, the referred paper assigns learnable parameters to the auxiliary information. The main purpose of compression is for efficiency, of course, but moreover, we expect that MoAI-Compressor corrects wrong information or eliminates non-relevant information for vision language tasks.
We can answer the question by the two types. In case of not using the compressed 64 tokens, auxiliary tokens are, on average, 1500 tokens. Therefore, training MoAI with image tokens + auxiliary tokens + language tokens brings in heavy burden, and wrong information can damage the output of MoAI. On the other hand, in case of using the compressed ones, just concatenating all features you mentioned can be an appropriate infusion strategy only when learning transformer decoder block with supervised fine tuning, LoRA, and QLoRA. However, we want to connect the original transformer decoder block with MoAI-Mixer and train only MoAI-Mixer, instead of directly tuning the transformer decoder. This is because we believe this plug-in-play infusion strategy can boost the utilization of LLMs without editing the original ones. Furthermore, the purpose of employing MoE is based on its effectiveness on the paper "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models" that provided a key in how to effectively harmonize auxiliary features with visual and language features, where the self-attended and cross-attended features are expected to independently capture various aspects compared with the jointly concatenated features.
Your helpful questions improved the contributions of the current paper and the future.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoLLaVO: Crayon Large Language and Vision mOdel (2024)
- Question Aware Vision Transformer for Multimodal Reasoning (2024)
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception (2024)
- Supervised Fine-tuning in turn Improves Visual Foundation Models (2024)
- Efficient Multimodal Learning from Data-centric Perspective (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
MoAI: Revolutionizing AI with a Fusion of Vision and Language Models
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper