deberta-v1-base / README.md
falca's picture
Update README.md
7b324fb
metadata
license: apache-2.0
language:
  - ru
  - en
library_name: transformers
pipeline_tag: feature-extraction

DeBERTa-base

Pretrained bidirectional encoder for russian language. The model was trained using standard MLM objective on large text corpora including open social data. See Training Details section for more information.

⚠️ This model contains only the encoder part without any pretrained head.

  • Developed by: deepvk
  • Model type: DeBERTa
  • Languages: Mostly russian and small fraction of other languages
  • License: Apache 2.0

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("deepvk/deberta-v1-base")
model = AutoModel.from_pretrained("deepvk/deberta-v1-base")

text = "Привет, мир!"

inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)

Training Details

Training Data

400 GB of filtered and deduplicated texts in total. A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.

Deduplication procedure

  1. Calculate shingles with size of 5
  2. Calculate MinHash with 100 seeds → for every sample (text) have a hash of size 100
  3. Split every hash into 10 buckets → every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash → we have 10 hashes for every sample
  4. For each bucket find duplicates: find samples which have the same hash → calculate pair-wise jaccard similarity → if the similarity is >0.7 than it's a duplicate
  5. Gather duplicates from all the buckets and filter

Training Hyperparameters

Argument Value
Training regime fp16 mixed precision
Optimizer AdamW
Adam betas 0.9,0.98
Adam eps 1e-6
Weight decay 1e-2
Batch size 2240
Num training steps 1kk
Num warm-up steps 10k
LR scheduler Linear
LR 2e-5
Gradient norm 1.0

The model was trained on a machine with 8xA100 for approximately 30 days.

Architecture details

Argument Value
Encoder layers 12
Encoder attention heads 12
Encoder embed dim 768
Encoder ffn embed dim 3,072
Activation function GeLU
Attention dropout 0.1
Dropout 0.1
Max positions 512
Vocab size 50266
Tokenizer type Byte-level BPE

Evaluation

We evaluated the model on Russian Super Glue dev set. The best result in each task is marked in bold. All models have the same size except the distilled version of DeBERTa.

Model RCB PARus MuSeRC TERRa RUSSE RWSD DaNetQA Score
vk-deberta-distill 0.433 0.56 0.625 0.59 0.943 0.569 0.726 0.635
vk-roberta-base 0.46 0.56 0.679 0.769 0.960 0.569 0.658 0.665
vk-deberta-base 0.450 0.61 0.722 0.704 0.948 0.578 0.76 0.682
vk-bert-base 0.467 0.57 0.587 0.704 0.953 0.583 0.737 0.657
sber-bert-base 0.491 0.61 0.663 0.769 0.962 0.574 0.678 0.678