File size: 1,693 Bytes
1818d35 49736c4 9cc7a78 49736c4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
language: cs
license: MIT
---
# Small-E-Czech
Small-E-Czech is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a Czech corpus created at Seznam.cz. Like other pretrained models, it should be finetuned on a downstream task of interest before use.
### How to use the discriminator in transformers
```python
from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch
discriminator = ElectraForPreTraining.from_pretrained("seznam/small-e-czech")
tokenizer = ElectraTokenizerFast.from_pretrained(
"seznam/small-e-czech", strip_accents=False
)
sentence = "Za hory, za doly, mé zlaté parohy"
fake_sentence = "Za hory, za doly, kočka zlaté parohy"
fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"]
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
outputs = discriminator(fake_inputs)
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()
for token in fake_sentence_tokens:
print("{:>7s}".format(token), end="")
print()
for prediction in predictions.squeeze():
print("{:7.1f}".format(prediction), end="")
print()
```
In the output we can see the probabilities of particular tokens not belonging in the sentence (i.e. having been faked by the generator) according to the discriminator:
```
[CLS] za hory , za dol ##y , kočka zlaté paro ##hy [SEP]
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.3 0.2 0.1 0.0
```
### Finetuning
For instructions on how to finetune the model on a new task, see the official HuggingFace transformers [tutorial](https://huggingface.co/transformers/training.html). |