arxiv:2407.11895

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Published on Jul 16

· Submitted by

Authors:

Abstract

Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces both only require lightweight networks, OmniBind is extremely training-efficient. Learning the largest 30B model requires merely unpaired unimodal data and approximately 3 days on a single 8-4090 node. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding.

View arXiv page View PDF Add to collection

Community

ZehanWang

Paper submitter Jul 17

Homepage is https://omnibind.github.io/

nielsr

Jul 18

Hi @ZehanWang congrats on this work!

Feel free to claim the paper by clicking on your name above the paper title.

i see you've released the models, here: https://huggingface.co/Viglong/OmniBind. Would be great to link it to this paper and fill in a model card, see the following resources:

linking to paper: https://huggingface.co/docs/hub/en/model-cards#linking-a-paper
writing a model card: https://huggingface.co/docs/hub/en/model-cards.

Also, ideally each checkpoint should be in a separate repository rather than having them all in a single one. We usually recommend this guide for uploading models: https://huggingface.co/docs/hub/models-uploading.

Let me know if you need any help!

Cheers,
Niels
Open-source @ HF

librarian-bot

Jul 18

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.11895 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.11895 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.11895 in a Space README.md to link it from this page.