Update README.md
Browse files
README.md
CHANGED
@@ -6813,7 +6813,7 @@ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=Tru
|
|
6813 |
outputs = model(**batch_dict)
|
6814 |
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
6815 |
|
6816 |
-
#
|
6817 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
6818 |
scores = (embeddings[:2] @ embeddings[2:].T) * 100
|
6819 |
print(scores.tolist())
|
@@ -6865,11 +6865,61 @@ For all labeled datasets, we only use its training set for fine-tuning.
|
|
6865 |
|
6866 |
For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
|
6867 |
|
6868 |
-
## Benchmark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6869 |
|
6870 |
Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
|
6871 |
on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
|
6872 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6873 |
## Citation
|
6874 |
|
6875 |
If you find our paper or models helpful, please consider cite as follows:
|
@@ -6885,4 +6935,4 @@ If you find our paper or models helpful, please consider cite as follows:
|
|
6885 |
|
6886 |
## Limitations
|
6887 |
|
6888 |
-
Long texts will be truncated to at most 512 tokens.
|
|
|
6813 |
outputs = model(**batch_dict)
|
6814 |
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
6815 |
|
6816 |
+
# normalize embeddings
|
6817 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
6818 |
scores = (embeddings[:2] @ embeddings[2:].T) * 100
|
6819 |
print(scores.tolist())
|
|
|
6865 |
|
6866 |
For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
|
6867 |
|
6868 |
+
## Benchmark Results on [Mr. TyDi](https://arxiv.org/abs/2108.08787)
|
6869 |
+
|
6870 |
+
| Model | Avg MRR@10 | | ar | bn | en | fi | id | ja | ko | ru | sw | te | th |
|
6871 |
+
|-----------------------|------------|-------|------| --- | --- | --- | --- | --- | --- | --- |------| --- | --- |
|
6872 |
+
| BM25 | 33.3 | | 36.7 | 41.3 | 15.1 | 28.8 | 38.2 | 21.7 | 28.1 | 32.9 | 39.6 | 42.4 | 41.7 |
|
6873 |
+
| mDPR | 16.7 | | 26.0 | 25.8 | 16.2 | 11.3 | 14.6 | 18.1 | 21.9 | 18.5 | 7.3 | 10.6 | 13.5 |
|
6874 |
+
| BM25 + mDPR | 41.7 | | 49.1 | 53.5 | 28.4 | 36.5 | 45.5 | 35.5 | 36.2 | 42.7 | 40.5 | 42.0 | 49.2 |
|
6875 |
+
| | |
|
6876 |
+
| multilingual-e5-small | 64.4 | | 71.5 | 66.3 | 54.5 | 57.7 | 63.2 | 55.4 | 54.3 | 60.8 | 65.4 | 89.1 | 70.1 |
|
6877 |
+
| multilingual-e5-base | 65.9 | | 72.3 | 65.0 | 58.5 | 60.8 | 64.9 | 56.6 | 55.8 | 62.7 | 69.0 | 86.6 | 72.7 |
|
6878 |
+
| multilingual-e5-large | **70.5** | | 77.5 | 73.2 | 60.8 | 66.8 | 68.5 | 62.5 | 61.6 | 65.8 | 72.7 | 90.2 | 76.2 |
|
6879 |
+
|
6880 |
+
## MTEB Benchmark Evaluation
|
6881 |
|
6882 |
Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
|
6883 |
on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
|
6884 |
|
6885 |
+
## Support for Sentence Transformers
|
6886 |
+
|
6887 |
+
Below is an example for usage with sentence_transformers.
|
6888 |
+
```python
|
6889 |
+
from sentence_transformers import SentenceTransformer
|
6890 |
+
model = SentenceTransformer('intfloat/multilingual-e5-base')
|
6891 |
+
input_texts = [
|
6892 |
+
'query: how much protein should a female eat',
|
6893 |
+
'query: 南瓜的家常做法',
|
6894 |
+
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 i s 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or traini ng for a marathon. Check out the chart below to see how much protein you should be eating each day.",
|
6895 |
+
"passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮 ,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右, 放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油 锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
|
6896 |
+
]
|
6897 |
+
embeddings = model.encode(input_texts, normalize_embeddings=True)
|
6898 |
+
```
|
6899 |
+
|
6900 |
+
Package requirements
|
6901 |
+
|
6902 |
+
`pip install sentence_transformers~=2.2.2`
|
6903 |
+
|
6904 |
+
Contributors: [michaelfeil](https://huggingface.co/michaelfeil)
|
6905 |
+
|
6906 |
+
## FAQ
|
6907 |
+
|
6908 |
+
**1. Do I need to add the prefix "query: " and "passage: " to input texts?**
|
6909 |
+
|
6910 |
+
Yes, this is how the model is trained, otherwise you will see a performance degradation.
|
6911 |
+
|
6912 |
+
Here are some rules of thumb:
|
6913 |
+
- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
|
6914 |
+
|
6915 |
+
- Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
|
6916 |
+
|
6917 |
+
- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
|
6918 |
+
|
6919 |
+
**2. Why are my reproduced results slightly different from reported in the model card?**
|
6920 |
+
|
6921 |
+
Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
|
6922 |
+
|
6923 |
## Citation
|
6924 |
|
6925 |
If you find our paper or models helpful, please consider cite as follows:
|
|
|
6935 |
|
6936 |
## Limitations
|
6937 |
|
6938 |
+
Long texts will be truncated to at most 512 tokens.
|