sdadas commited on
Commit
9e8d6d6
1 Parent(s): 31c9c76

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -1
README.md CHANGED
@@ -5,7 +5,53 @@ tags:
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
 
 
 
 
 
 
 
 
 
8
 
9
  ---
10
 
11
- # mmlw-retrieval-e5-large
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
8
+ - information-retrieval
9
+ language: pl
10
+ license: apache-2.0
11
+ widget:
12
+ - source_sentence: "query: Jak dożyć 100 lat?"
13
+ sentences:
14
+ - "passage: Trzeba zdrowo się odżywiać i uprawiać sport."
15
+ - "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
16
+ - "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
17
 
18
  ---
19
 
20
+ <h1 align="center">MMLW-retrieval-e5-large</h1>
21
+
22
+ MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
23
+ This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors.
24
+ The model was developed using a two-step procedure:
25
+ - In the first step, it was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-large-en) as teacher models for distillation.
26
+ - The second step involved fine-tuning the obtained models with contrastrive loss on [Polish MS MARCO](https://huggingface.co/datasets/clarin-knext/msmarco-pl) training split. In order to improve the efficiency of contrastive training, we used large batch sizes - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs.
27
+
28
+ ## Usage (Sentence-Transformers)
29
+
30
+ ⚠️ Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️
31
+
32
+ You can use the model like this with [sentence-transformers](https://www.SBERT.net):
33
+
34
+ ```python
35
+ from sentence_transformers import SentenceTransformer
36
+ from sentence_transformers.util import cos_sim
37
+
38
+ query_prefix = "query: "
39
+ answer_prefix = "passage: "
40
+ queries = [query_prefix + "Jak dożyć 100 lat?"]
41
+ answers = [
42
+ answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
43
+ answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
44
+ answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
45
+ ]
46
+ model = SentenceTransformer("sdadas/mmlw-retrieval-e5-large")
47
+ queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
48
+ answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
49
+
50
+ best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
51
+ print(answers[best_answer])
52
+ # Trzeba zdrowo się odżywiać i uprawiać sport.
53
+ ```
54
+
55
+ ## Evaluation Results
56
+
57
+ The model achieves **NDCG@10** of **58.05** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.