BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Abstract
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
Community
Unlocking Vision-Language Magic: The Secret of BLIP-2!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs (2024)
- Enhancing Vision-Language Model with Unmasked Token Alignment (2024)
- A Single Transformer for Scalable Vision-Language Modeling (2024)
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning (2024)
- Multi-Modal Generative Embedding Model (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 42
Browse 42 models citing this paperDatasets citing this paper 1
Spaces citing this paper 221
Collections including this paper 0
No Collection including this paper