cosine similarity very high for any pair of sentences
I'm using this model for sentence embeddings and i'm trying to check if similar sentences get high cosine similarity as expected and very different sentences get low cosine similarity.
However, no matter how different I try my sentences/words pairs to be, I still receive high cosine similarity (above 0.7)
Usually, cosine similarity for embeddings range between 0 to 1, and only very close sentences get above 0.7
Is there an explanation for this?
examples:
sent 1 is: query: I just love fruit
sent 2 is: query: a totally different sentence that has nothing in common
Cosine Similarity: 0.75555754
sent 1 is: query: how are you feeling?
sent 2 is: query: I was born in 1933, germany
Cosine Similarity: 0.75990283
sent 1 is: query: The company announced record profits for the fiscal year.
sent 2 is: query: She carefully crafted a beautiful painting on the canvas.
Cosine Similarity: 0.74772125
sent 1 is: query: armadillo
sent 2 is: query: read
Cosine Similarity: 0.79260266
sent 1 is: query: ajkshdk
sent 2 is: query: here
Cosine Similarity: 0.8243466
sent 1 is: query: glass
sent 2 is: query: smelly
Cosine Similarity: 0.8391364
sent 1 is: query: beautiful
sent 2 is: query: wound
Cosine Similarity: 0.84729916
Hi @GiliGold
What matters for similarity search / text ranking is the relative order of the scores instead of the absolute values. An embedding model is good as long as the relevant text pairs receive higher scores than irrelevant text pairs, even if the score difference is small.
From a more technical perspective, the reason the scores are distributed around 0.7 to 1.0 is that we use a small temperature 0.01 for InfoNCE contrastive loss.
Please also refer to https://github.com/microsoft/unilm/issues/1216#issuecomment-1646842947
I experienced the same behavior as
@GiliGold
. I want to use the embeddings as a "coarse filter step" to decide which documents i should process further and which are not in any relation to a certain topic.
Due to the dense cosine similarity values it is hard to determine an appropriate threshold for this. do you have any recommendation on how to perform this kind of semantic search?
@LKriesch
Even though the score distribution is dense, the ranking of scores are still informative, you can choose the threshold as usual. In fact, if you do a simple linear transformation new_score = 2 * (old_score - 0.85) / (1.0 - 0.7)
, now the scores distribute around (-1, 1)
.
In the literature of information retrieval, most folks use embeddings to efficiently retrieve the top-k documents, and then re-rank these documents using a more powerful but less efficient model. Maybe you can do coarse filter in a similar way instead of relying on the absolute values.
@intfloat I appreciate your answer, this has been bugging me too!
Could you give an example of a more powerful but less efficient model to use for RAG after document retrieval?
I'm wondering what the SOTA is for RAG, as far as finding documents, and then searching documents.
@intfloat I appreciate your answer, this has been bugging me too!
Could you give an example of a more powerful but less efficient model to use for RAG after document retrieval?
I'm wondering what the SOTA is for RAG, as far as finding documents, and then searching documents.
For RAG after document retrieval, you can use popular open-source LLMs such as LLaMA.
As for the SOTA of RAG, different papers adopt quite different evaluation settings, so it is hard to say which method is better.