yhavinga
/

t5-base-dutch

@@ -17,42 +17,5 @@ inference: false
 * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
 * Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
 * Using improved training script - no more exceptions during training, so no restarting required.
-* All models trained with tensorflow metrics.
-* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!
-|                       |`t5-base-dutch`          |`t5-v1.1-base-dutch`     |`t5-v1.1-large-dutch-cased`| `t5-v1.1-base-dutch-uncased`|
-|-----------------------|-------------------------|-------------------------|---------------------------|-----------------------------|
-|`tokenizer`            |`cased`                  |`uncased`                |`cased`                    |`uncased`                    |
-|`source model config`  |`google/t5-base`         |`google/t5-v1_1-base`    |`google/t5-v1_1-large`     |`google/t5-v1_1_base`        |
-|`dataset`              |`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`  |`yhavinga/mc4_nl_cleaned`    |
-|`tpu vm`               | two                     | one                     | three                     | one                         |
-|`finished`             |                         | YES                     |                           |                             |
-|*Hyperparameters*      |                         |                         |                           |                             |
-|`epochs`               | 1                       | 1                       | 4                         | 2                           |
-|`per-device batch size`| 16                      | 16                      | 2                         | 8                           |
-|`tot. batch size`      | 128                     | 128                     | 16                        | ?                           |
-|`steps`                | 508 976                 | 508 976                 | 8 428 012                 | ?                           |
-|`max seq. length`      | 512                     | 512                     | 1024                      | 1024                        |
-|`tot. tok. trained on` | 33B                     | 33B                     | 138B                      | ?                           |
-|`optimizer`            | adafactor               | adafactor               | adafactor                 | adafactor                   |
-|`warmup steps`         | 10000                   | 10000                   | 10000                     | 10000                       |
-|`learning rate`        | 0.005                   | 0.005                   | 0.005                     | 0.005                       |
-|`weigth decay`         | 0.01                    | 0.01                    | 0.01                      | 0.001                       |
-|`tie embeds`           |`false`                  |`false`                  |`false`                    |`false`                      |
-|`validation split size`| 15K examples            | 15K examples            | 15K examples              | 15K examples                |
-|*Model config*         |                         |                         |                           |                             |
-|`d_ff`                 | 3072                    | 2048                    | 2816                      | 2048                        |
-|`d_kv`                 | 64                      | 64                      | 64                        | 64                          |
-|`d_model`              | 768                     | 768                     | 1024                      | 768                         |
-|`dropout rate`         | 0.1                     | 0.1                     | 0.1 (0.0 wh. pre-train.)  | 0.1 (0.0 wh. pre-train.)    |
-|`ff projection`        |`relu`                   |`gated-gelu`             |`gated-gelu`               |`gated-relu`                 |
-|`num decoder layers`   | 12                      | 12                      | 24                        | 12                          |
-|`num heads`            | 12                      | 12                      | 16                        | 12                          |
-|`num layers`           | 12                      | 12                      | 24                        | 12                          |
-|`rel. attn. buckets`   | 32                      | 32                      | 32                        | 32                          |
-|`vocab size`           | 32103                   | 32103                   | 32103                     | 32103                       |
-|*Training time*        | ~ 100 hours             | 101 hours               | ~ 370 hours               | ?                           |
-|*Evaluation*           |                         |                         |                           |                             |
-|`accuracy`             |                         | 0.6976                  |                           |                             |
-|`loss`                 |                         | 1.379                   |                           |                             |

 * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
 * Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
 * Using improved training script - no more exceptions during training, so no restarting required.
+* All models trained with tensorboard metrics.
+* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!