Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Abstract
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval (2024)
- Mistral-SPLADE: LLMs for better Learned Sparse Retrieval (2024)
- Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment (2024)
- Mamba Retriever: Utilizing Mamba for Effective and Efficient Dense Retrieval (2024)
- NV-Retriever: Improving text embedding models with effective hard-negative mining (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper