sdadas commited on
Commit
69c4876
1 Parent(s): 6a0c72f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md CHANGED
@@ -1,3 +1,72 @@
1
  ---
 
 
 
 
 
2
  license: apache-2.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-classification
3
+ tags:
4
+ - transformers
5
+ - information-retrieval
6
+ language: pl
7
  license: apache-2.0
8
+
9
  ---
10
+
11
+ <h1 align="center">polish-reranker-large-mse</h1>
12
+
13
+ This is a Polish text ranking model trained using the mean squared error (MSE) distillation method on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents.
14
+ The training data included the following parts: 1) The Polish MS MARCO training split (800k queries); 2) The ELI5 dataset translated to Polish (over 500k queries); 3) A collection of Polish medical questions and answers (approximately 100k queries).
15
+ As a teacher model, we employed [unicamp-dl/mt5-13b-mmarco-100k](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k), a large multilingual reranker based on the MT5-XXL architecture. As a student model, we choose [Polish RoBERTa](https://huggingface.co/sdadas/polish-roberta-large-v2).
16
+ In the MSE method, the student is trained to directly replicate the outputs returned by the teacher.
17
+
18
+ ## Usage (Sentence-Transformers)
19
+
20
+ You can use the model like this with [sentence-transformers](https://www.SBERT.net):
21
+
22
+ ```python
23
+ from sentence_transformers import CrossEncoder
24
+ import torch.nn
25
+
26
+ query = "Jak dożyć 100 lat?"
27
+ answers = [
28
+ "Trzeba zdrowo się odżywiać i uprawiać sport.",
29
+ "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
30
+ "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
31
+ ]
32
+
33
+ model = CrossEncoder(
34
+ "sdadas/polish-reranker-large-mse",
35
+ default_activation_function=torch.nn.Identity(),
36
+ max_length=512,
37
+ device="cuda" if torch.cuda.is_available() else "cpu"
38
+ )
39
+ pairs = [[query, answer] for answer in answers]
40
+ results = model.predict(pairs)
41
+ print(results.tolist())
42
+ ```
43
+
44
+ ## Usage (Huggingface Transformers)
45
+
46
+ The model can also be used with Huggingface Transformers in the following way:
47
+
48
+ ```python
49
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
50
+ import numpy as np
51
+
52
+ query = "Jak dożyć 100 lat?"
53
+ answers = [
54
+ "Trzeba zdrowo się odżywiać i uprawiać sport.",
55
+ "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
56
+ "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
57
+ ]
58
+
59
+ model_name = "sdadas/polish-reranker-large-mse"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
61
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
62
+ texts = [f"{query}</s></s>{answer}" for answer in answers]
63
+ tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
64
+ output = model(**tokens)
65
+ results = output.logits.detach().numpy()
66
+ results = np.squeeze(results)
67
+ print(results.tolist())
68
+ ```
69
+
70
+ ## Evaluation Results
71
+
72
+ The model achieves **NDCG@10** of **60.27** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.