File size: 2,060 Bytes
6a9922d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bde2e21
6a9922d
 
 
 
 
 
 
 
 
 
 
 
 
 
bde2e21
6a9922d
 
 
 
 
 
 
b446ea4
 
6a9922d
 
 
 
 
 
 
 
9fa2adf
 
 
 
 
 
6a9922d
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
language: pl
tags:
- T5
- translation
- summarization
- question answering
- reading comprehension
datasets:
- ccnet
- nkjp
- wikipedia
- open subtitles
- free readings
license: cc-by-4.0
---

# plT5 Large
**plT5** models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target.

## Corpus
plT5 was trained on six different corpora available for Polish language:

| Corpus | Tokens | Documents |
| :------ | ------: | ------: |
| [CCNet Middle](https://github.com/facebookresearch/cc_net) | 3243M  | 7.9M |
| [CCNet Head](https://github.com/facebookresearch/cc_net) | 2641M  | 7.0M |
| [National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=1)| 1357M  | 3.9M |
| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1056M  | 1.1M 
| [Wikipedia](https://dumps.wikimedia.org/) | 260M  | 1.4M |
| [Wolne Lektury](https://wolnelektury.pl/) | 41M  | 5.5k |

## Tokenizer
The training dataset was tokenized into subwords using a sentencepiece unigram model with
vocabulary size of 50k tokens. 

## Usage
Example code:
```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-large")
model = AutoModel.from_pretrained("allegro/plt5-large")
```

## License
CC BY 4.0

## Citation
If you use this model, please cite the following paper:
```
@article{chrabrowa2022evaluation,
  title={Evaluation of Transfer Learning for Polish with a Text-to-Text Model},
  author={Chrabrowa, Aleksandra and Dragan, {\L}ukasz and Grzegorczyk, Karol and Kajtoch, Dariusz and Koszowski, Miko{\l}aj and Mroczkowski, Robert and Rybak, Piotr},
  journal={arXiv preprint arXiv:2205.08808},
  year={2022}
}
```

## Authors
The model was trained by [**Machine Learning Research Team at Allegro**](https://ml.allegro.tech/) and [**Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**](http://zil.ipipan.waw.pl/).

You can contact us at: <a href="mailto:[email protected]">[email protected]</a>