metadata
language:
- nl
datasets:
- yhavinga/mc4_nl_cleaned
tags:
- seq2seq
- lm-head
license: apache-2.0
inference: false
Work in progress. Dec 2021.
A collection of Dutch T5 models
- Many thanks to the Google TPU Research Cloud for providing access to a TPU cluster!
- Continuation of work started during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.
- Using improved training script - no more exceptions during training, so no restarting required.
- All models trained with tensorflow metrics.
- Thanks to @gsarti for creating the t5-flax-gcp repository!
t5-base-dutch |
t5-v1.1-base-dutch |
t5-v1.1-large-dutch-cased |
t5-v1.1-base-dutch-uncased |
|
---|---|---|---|---|
tokenizer |
cased |
uncased |
cased |
uncased |
source model config |
google/t5-base |
google/t5-v1_1-base |
google/t5-v1_1-large |
google/t5-v1_1_base |
dataset |
yhavinga/mc4_nl_cleaned |
yhavinga/mc4_nl_cleaned |
yhavinga/mc4_nl_cleaned |
yhavinga/mc4_nl_cleaned |
tpu vm |
two | one | three | one |
finished |
YES | |||
Hyperparameters | ||||
epochs |
1 | 1 | 4 | 2 |
per-device batch size |
16 | 16 | 2 | 8 |
tot. batch size |
128 | 128 | 16 | ? |
steps |
508 976 | 508 976 | 8 428 012 | ? |
max seq. length |
512 | 512 | 1024 | 1024 |
tot. tok. trained on |
33B | 33B | 138B | ? |
optimizer |
adafactor | adafactor | adafactor | adafactor |
warmup steps |
10000 | 10000 | 10000 | 10000 |
learning rate |
0.005 | 0.005 | 0.005 | 0.005 |
weigth decay |
0.01 | 0.01 | 0.01 | 0.001 |
tie embeds |
false |
false |
false |
false |
validation split size |
15K examples | 15K examples | 15K examples | 15K examples |
Model config | ||||
d_ff |
3072 | 2048 | 2816 | 2048 |
d_kv |
64 | 64 | 64 | 64 |
d_model |
768 | 768 | 1024 | 768 |
dropout rate |
0.1 | 0.1 | 0.1 (0.0 wh. pre-train.) | 0.1 (0.0 wh. pre-train.) |
ff projection |
relu |
gated-gelu |
gated-gelu |
gated-relu |
num decoder layers |
12 | 12 | 24 | 12 |
num heads |
12 | 12 | 16 | 12 |
num layers |
12 | 12 | 24 | 12 |
rel. attn. buckets |
32 | 32 | 32 | 32 |
vocab size |
32103 | 32103 | 32103 | 32103 |
Training time | ~ 100 hours | 101 hours | ~ 370 hours | ? |
Evaluation | ||||
accuracy |
0.6976 | |||
loss |
1.379 |