File size: 1,037 Bytes
fdba4f9
 
 
 
 
 
 
 
 
 
 
ce955e5
fdba4f9
 
 
ce955e5
fdba4f9
ce955e5
fdba4f9
ce955e5
 
 
 
 
 
 
 
 
fdba4f9
 
 
 
 
ce955e5
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
language: dv
---

# byt5-dv

Pretrained from scratch on Dhivei (language of the Maldives)
with ByT5, Google's new byte-level tokenizer strategy.

Corpus: dv.wikipedia.org as of March 2020 (TFDS)

Notebook - Pretraining on Wikipedia: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH

## Demo

Notebook - Finetuning on Maldivian news classification task: https://colab.research.google.com/drive/11u5SafR4bKICmArgDl6KQ9vqfYtDpyWp

Current performance:

- mBERT: 52%
- byt5-dv (first run): 78%
- dv-wave (ELECTRA): 89%
- dv-muril: 90.7%
- dv-labse: 91.3-91.5%

Source of dataset: https://github.com/Sofwath/DhivehiDatasets

## Work in progress - todos

The Wikipedia corpus is too small for this language. In the future I would add
OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
as one TFDS dataset.

This is based on ByT5-small ... we should try a larger model

This needs more time for pretraining

This needs better finetuning (reformatting batches to get all training data)