Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- anli
|
4 |
+
- zen-E/ANLI-simcse-roberta-large-embeddings-pca-256
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- spearmanr
|
9 |
+
- pearsonr
|
10 |
+
library_name: transformers
|
11 |
+
---
|
12 |
+
|
13 |
+
The model is trained by knowledge distillation between the "princeton-nlp/unsup-simcse-roberta-large" and "zen-E/bert-mini-sentence-distil-unsupervised" on the 'ANLI'.
|
14 |
+
|
15 |
+
The model can perform inferencing by Automodel.
|
16 |
+
|
17 |
+
The model achieves 0.836 and 0.840 for pearsonr and spearmanr respectively on STS-b test dataset.
|
18 |
+
|
19 |
+
For more training detail, the training config and the pytorch forward function is as follows. The teacher's fearure is first transform to a size of 256 by the PCA object in "zen-E/bert-mini-sentence-distil-unsupervised" which can be loaded by:
|
20 |
+
|
21 |
+
```python
|
22 |
+
import joblib
|
23 |
+
pca = joblib.load('ANLI-simcse-roberta-large-embeddings-pca-256/pca_model.sav')
|
24 |
+
features_256 = pca.transform(features)
|
25 |
+
```
|
26 |
+
|
27 |
+
```python
|
28 |
+
config = {
|
29 |
+
'epoch' = 10,
|
30 |
+
'learning_rate' = 5e-5,
|
31 |
+
'batch_size' = 512,
|
32 |
+
'temperature' = 0.05
|
33 |
+
}
|
34 |
+
```
|
35 |
+
|
36 |
+
```python
|
37 |
+
def forward_cos_mse_kd(self, sentence1s, sentence2s, sentence3s, teacher_sentence1_embs, teacher_sentence2_embs, teacher_sentence3_embs):
|
38 |
+
"""forward function for the ANLI dataset"""
|
39 |
+
_, o1 = self.bert(**sentence1s)
|
40 |
+
_, o2 = self.bert(**sentence2s)
|
41 |
+
_, o3 = self.bert(**sentence3s)
|
42 |
+
|
43 |
+
# compute student's cosine similarity between sentences
|
44 |
+
cos_o1_o2 = cosine_sim(o1, o2)
|
45 |
+
cos_o1_o3 = cosine_sim(o1, o3)
|
46 |
+
|
47 |
+
# compute teacher's cosine similarity between sentences
|
48 |
+
cos_o1_o2_t = cosine_sim(teacher_sentence1_embs, teacher_sentence2_embs)
|
49 |
+
cos_o1_o3_t = cosine_sim(teacher_sentence1_embs, teacher_sentence3_embs)
|
50 |
+
|
51 |
+
cos_sim = torch.cat((cos_o1_o2, cos_o1_o3), dim=-1)
|
52 |
+
cos_sim_t = torch.cat((cos_o1_o2_t, cos_o1_o3_t), dim=-1)
|
53 |
+
|
54 |
+
# KL Divergence between student and teacher probabilities
|
55 |
+
soft_teacher_probs = F.softmax(cos_sim_t / self.temperature, dim=1)
|
56 |
+
kd_cos_loss = F.kl_div(F.log_softmax(cos_sim / self.temperature, dim=1),
|
57 |
+
soft_teacher_probs,
|
58 |
+
reduction='batchmean')
|
59 |
+
|
60 |
+
# mse loss
|
61 |
+
o = torch.cat([o1, o2, o3], dim=0)
|
62 |
+
teacher_embs = torch.cat([teacher_sentence1_embs, teacher_sentence2_embs, teacher_sentence3_embs], dim=0)
|
63 |
+
kd_mse_loss = nn.MSELoss()(o, teacher_embs)/3
|
64 |
+
|
65 |
+
# equal weight for the two losses
|
66 |
+
total_loss = kd_cos_loss*0.5+kd_mse_loss*0.5
|
67 |
+
return total_loss, kd_cos_loss, kd_mse_loss
|
68 |
+
```
|