sdadas
/

mmlw-retrieval-e5-large

Sentence Similarity

sentence-transformers

feature-extraction

information-retrieval

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

sdadas commited on Oct 21, 2023

Commit

9e8d6d6

•

1 Parent(s): 31c9c76

Update README.md

Files changed (1) hide show

README.md +47 -1

README.md CHANGED Viewed

@@ -5,7 +5,53 @@ tags:
 - feature-extraction
 - sentence-similarity
 - transformers
 ---
-# mmlw-retrieval-e5-large

 - feature-extraction
 - sentence-similarity
 - transformers
+- information-retrieval
+language: pl
+license: apache-2.0
+widget:
+- source_sentence: "query: Jak dożyć 100 lat?"
+  sentences:
+    - "passage: Trzeba zdrowo się odżywiać i uprawiać sport."
+    - "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
+    - "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
 ---
+<h1 align="center">MMLW-retrieval-e5-large</h1>
+MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
+This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors.
+The model was developed using a two-step procedure:
+- In the first step, it was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-large-en) as teacher models for distillation.
+- The second step involved fine-tuning the obtained models with contrastrive loss on [Polish MS MARCO](https://huggingface.co/datasets/clarin-knext/msmarco-pl) training split. In order to improve the efficiency of contrastive training, we used large batch sizes - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs.
+## Usage (Sentence-Transformers)
+⚠️ Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model,  queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️
+You can use the model like this with [sentence-transformers](https://www.SBERT.net):
+```python
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.util import cos_sim
+query_prefix = "query: "
+answer_prefix = "passage: "
+queries = [query_prefix + "Jak dożyć 100 lat?"]
+answers = [
+    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
+    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
+    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
+]
+model = SentenceTransformer("sdadas/mmlw-retrieval-e5-large")
+queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
+answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
+best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
+print(answers[best_answer])
+# Trzeba zdrowo się odżywiać i uprawiać sport.
+```
+## Evaluation Results
+The model achieves **NDCG@10** of **58.05** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.