File size: 1,037 Bytes
fdba4f9 ce955e5 fdba4f9 ce955e5 fdba4f9 ce955e5 fdba4f9 ce955e5 fdba4f9 ce955e5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
---
language: dv
---
# byt5-dv
Pretrained from scratch on Dhivei (language of the Maldives)
with ByT5, Google's new byte-level tokenizer strategy.
Corpus: dv.wikipedia.org as of March 2020 (TFDS)
Notebook - Pretraining on Wikipedia: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH
## Demo
Notebook - Finetuning on Maldivian news classification task: https://colab.research.google.com/drive/11u5SafR4bKICmArgDl6KQ9vqfYtDpyWp
Current performance:
- mBERT: 52%
- byt5-dv (first run): 78%
- dv-wave (ELECTRA): 89%
- dv-muril: 90.7%
- dv-labse: 91.3-91.5%
Source of dataset: https://github.com/Sofwath/DhivehiDatasets
## Work in progress - todos
The Wikipedia corpus is too small for this language. In the future I would add
OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
as one TFDS dataset.
This is based on ByT5-small ... we should try a larger model
This needs more time for pretraining
This needs better finetuning (reformatting batches to get all training data) |