Post
858
๐๐ผ๐ ๐๐ผ ๐ฟ๐ฒ-๐ฟ๐ฎ๐ป๐ธ ๐๐ผ๐๐ฟ ๐๐ป๐ถ๐ฝ๐ฝ๐ฒ๐๐ ๐ถ๐ป ๐ฅ๐๐ โ ColBERT, Rerankers, Cross-Encoders
Letโs say youโre doing RAG, and in an effort to improve performance, you try to rerank a few possible source snippets by their relevancy to a query.
How can you score similarity between your query and any source document? ๐ค ๐ โ๏ธ ๐
๐ญ. ๐๐๐๐ ๐๐๐ฒ ๐ฒ๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด๐ : ๐ก๐ผ-๐ถ๐ป๐๐ฒ๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป ๐๏ธ
This means that you encode each token from both the query and the doc as separate vectors, then average the tokens of each separately to get in total 2 vectors, then you compute similarity via cosine or something.
โก๏ธ Notable examples: Check the top of the MTEB leaderboard!
๐ฎ. ๐๐ฎ๐๐ฒ-๐ถ๐ป๐๐ฒ๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป: ๐๐ต๐ถ๐ ๐ถ๐ ๐๐ผ๐น๐๐๐ฅ๐ง ๐
These encode each token from both query and doc as separate vectors as before, but compare all together without previously averaging them and losing information.
This is more accurate than no-interaction but also slower because you have to compare n*m vectors instead of 2. At least you can store documents in memory. And ColBERT has some optimisations like pooling to be faster.
โก๏ธ Notable examples: ColBERTv2, mxbai-colbert-large-v1, jina-colbert-v2
๐ฏ. ๐๐ฎ๐ฟ๐น๐ ๐ถ๐ป๐๐ฒ๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป: ๐๐ฟ๐ผ๐๐-๐ฒ๐ป๐ฐ๐ผ๐ฑ๐ฒ๐ฟ๐ ๐๏ธ
Basically you run the concatenated query + document in a model to get a final score.
This is the most accurate, but also the slowest since it gets really long when you have many docs to rerank! And you cannot pre-store embeddings.
โก๏ธ Notable examples: MixedBread or Jina AI rerankers!
๐ So what you choose is a trade-off between speed and accuracy: I think ColBERT is often a really good choice!
Based on this great post by Jina AI ๐ https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter
Letโs say youโre doing RAG, and in an effort to improve performance, you try to rerank a few possible source snippets by their relevancy to a query.
How can you score similarity between your query and any source document? ๐ค ๐ โ๏ธ ๐
๐ญ. ๐๐๐๐ ๐๐๐ฒ ๐ฒ๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด๐ : ๐ก๐ผ-๐ถ๐ป๐๐ฒ๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป ๐๏ธ
This means that you encode each token from both the query and the doc as separate vectors, then average the tokens of each separately to get in total 2 vectors, then you compute similarity via cosine or something.
โก๏ธ Notable examples: Check the top of the MTEB leaderboard!
๐ฎ. ๐๐ฎ๐๐ฒ-๐ถ๐ป๐๐ฒ๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป: ๐๐ต๐ถ๐ ๐ถ๐ ๐๐ผ๐น๐๐๐ฅ๐ง ๐
These encode each token from both query and doc as separate vectors as before, but compare all together without previously averaging them and losing information.
This is more accurate than no-interaction but also slower because you have to compare n*m vectors instead of 2. At least you can store documents in memory. And ColBERT has some optimisations like pooling to be faster.
โก๏ธ Notable examples: ColBERTv2, mxbai-colbert-large-v1, jina-colbert-v2
๐ฏ. ๐๐ฎ๐ฟ๐น๐ ๐ถ๐ป๐๐ฒ๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป: ๐๐ฟ๐ผ๐๐-๐ฒ๐ป๐ฐ๐ผ๐ฑ๐ฒ๐ฟ๐ ๐๏ธ
Basically you run the concatenated query + document in a model to get a final score.
This is the most accurate, but also the slowest since it gets really long when you have many docs to rerank! And you cannot pre-store embeddings.
โก๏ธ Notable examples: MixedBread or Jina AI rerankers!
๐ So what you choose is a trade-off between speed and accuracy: I think ColBERT is often a really good choice!
Based on this great post by Jina AI ๐ https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter