DeepPavlovAI
commited on
Commit
•
c5ae87f
1
Parent(s):
a146645
Update README.md
Browse files
README.md
CHANGED
@@ -9,9 +9,9 @@ Our DistilRuBERT-tiny was highly inspired by \[3\], \[4\]. Namely, we used
|
|
9 |
* KL loss (between teacher and student output logits)
|
10 |
* MLM loss (between tokens labels and student output logits)
|
11 |
* Cosine embedding loss (between mean of six consecutive hidden states from teacher's encoder and one hidden state of the student)
|
12 |
-
* MSE loss (between six consecutive attention maps from teacher's encoder and one attention map of the student)
|
13 |
|
14 |
-
The model was trained for about
|
15 |
|
16 |
To evaluate improvements in the inference speed, we ran teacher and student models on random sequences with seq_len=512, batch_size = 16 (for throughput) and batch_size=1 (for latency).
|
17 |
All tests were performed on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and nVIDIA Tesla P100-SXM2.0 16Gb.
|
|
|
9 |
* KL loss (between teacher and student output logits)
|
10 |
* MLM loss (between tokens labels and student output logits)
|
11 |
* Cosine embedding loss (between mean of six consecutive hidden states from teacher's encoder and one hidden state of the student)
|
12 |
+
* MSE loss (between mean of six consecutive attention maps from teacher's encoder and one attention map of the student)
|
13 |
|
14 |
+
The model was trained for about 80 hrs. on 8 nVIDIA Tesla P100-SXM2.0 16Gb.
|
15 |
|
16 |
To evaluate improvements in the inference speed, we ran teacher and student models on random sequences with seq_len=512, batch_size = 16 (for throughput) and batch_size=1 (for latency).
|
17 |
All tests were performed on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and nVIDIA Tesla P100-SXM2.0 16Gb.
|