monsoon-nlp commited on
Commit
ce955e5
1 Parent(s): 682a3ba

add finetuning demo, caveats

Browse files
Files changed (1) hide show
  1. README.md +17 -2
README.md CHANGED
@@ -9,15 +9,30 @@ with ByT5, Google's new byte-level tokenizer strategy.
9
 
10
  Corpus: dv.wikipedia.org as of March 2020 (TFDS)
11
 
12
- Notebook: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH
13
 
14
  ## Demo
15
 
 
16
 
 
17
 
18
- ## Todos
 
 
 
 
 
 
 
 
19
 
20
  The Wikipedia corpus is too small for this language. In the future I would add
21
  OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
22
  as one TFDS dataset.
23
 
 
 
 
 
 
 
9
 
10
  Corpus: dv.wikipedia.org as of March 2020 (TFDS)
11
 
12
+ Notebook - Pretraining on Wikipedia: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH
13
 
14
  ## Demo
15
 
16
+ Notebook - Finetuning on Maldivian news classification task: https://colab.research.google.com/drive/11u5SafR4bKICmArgDl6KQ9vqfYtDpyWp
17
 
18
+ Current performance:
19
 
20
+ - mBERT: 52%
21
+ - byt5-dv (first run): 78%
22
+ - dv-wave (ELECTRA): 89%
23
+ - dv-muril: 90.7%
24
+ - dv-labse: 91.3-91.5%
25
+
26
+ Source of dataset: https://github.com/Sofwath/DhivehiDatasets
27
+
28
+ ## Work in progress - todos
29
 
30
  The Wikipedia corpus is too small for this language. In the future I would add
31
  OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
32
  as one TFDS dataset.
33
 
34
+ This is based on ByT5-small ... we should try a larger model
35
+
36
+ This needs more time for pretraining
37
+
38
+ This needs better finetuning (reformatting batches to get all training data)