DeepPavlovAI commited on
Commit
c5ae87f
1 Parent(s): a146645

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -9,9 +9,9 @@ Our DistilRuBERT-tiny was highly inspired by \[3\], \[4\]. Namely, we used
9
  * KL loss (between teacher and student output logits)
10
  * MLM loss (between tokens labels and student output logits)
11
  * Cosine embedding loss (between mean of six consecutive hidden states from teacher's encoder and one hidden state of the student)
12
- * MSE loss (between six consecutive attention maps from teacher's encoder and one attention map of the student)
13
 
14
- The model was trained for about 30 hrs. on 8 nVIDIA Tesla P100-SXM2.0 16Gb.
15
 
16
  To evaluate improvements in the inference speed, we ran teacher and student models on random sequences with seq_len=512, batch_size = 16 (for throughput) and batch_size=1 (for latency).
17
  All tests were performed on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and nVIDIA Tesla P100-SXM2.0 16Gb.
 
9
  * KL loss (between teacher and student output logits)
10
  * MLM loss (between tokens labels and student output logits)
11
  * Cosine embedding loss (between mean of six consecutive hidden states from teacher's encoder and one hidden state of the student)
12
+ * MSE loss (between mean of six consecutive attention maps from teacher's encoder and one attention map of the student)
13
 
14
+ The model was trained for about 80 hrs. on 8 nVIDIA Tesla P100-SXM2.0 16Gb.
15
 
16
  To evaluate improvements in the inference speed, we ran teacher and student models on random sequences with seq_len=512, batch_size = 16 (for throughput) and batch_size=1 (for latency).
17
  All tests were performed on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and nVIDIA Tesla P100-SXM2.0 16Gb.