DeepPavlov
/

distilrubert-tiny-cased-conversational

Inference Endpoints

Model card Files Files and versions Community

DeepPavlovAI commited on Sep 23, 2021

Commit

c5ae87f

•

1 Parent(s): a146645

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -9,9 +9,9 @@ Our DistilRuBERT-tiny was highly inspired by \[3\], \[4\]. Namely, we used
 * KL loss (between teacher and student output logits)
 * MLM loss (between tokens labels and student output logits)
 * Cosine embedding loss (between mean of six consecutive hidden states from teacher's encoder and one hidden state of the student)
-* MSE loss (between six consecutive attention maps from teacher's encoder and one attention map of the student)
-The model was trained for about 30 hrs. on 8 nVIDIA Tesla P100-SXM2.0 16Gb.
 To evaluate improvements in the inference speed, we ran teacher and student models on random sequences with seq_len=512, batch_size = 16 (for throughput) and batch_size=1 (for latency).
 All tests were performed on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and nVIDIA Tesla P100-SXM2.0 16Gb.

 * KL loss (between teacher and student output logits)
 * MLM loss (between tokens labels and student output logits)
 * Cosine embedding loss (between mean of six consecutive hidden states from teacher's encoder and one hidden state of the student)
+* MSE loss (between mean of six consecutive attention maps from teacher's encoder and one attention map of the student)
+The model was trained for about 80 hrs. on 8 nVIDIA Tesla P100-SXM2.0 16Gb.
 To evaluate improvements in the inference speed, we ran teacher and student models on random sequences with seq_len=512, batch_size = 16 (for throughput) and batch_size=1 (for latency).
 All tests were performed on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and nVIDIA Tesla P100-SXM2.0 16Gb.