t5-base-dutch / README.md

Yeb Havinga

Autoupdate README.md

b1bd3b1 over 2 years ago

21.6 kB

	---
	language:
	- nl
	datasets:
	- yhavinga/mc4_nl_cleaned
	tags:
	- t5
	- seq2seq

	inference: false
	license: apache-2.0
	---

	# t5-base-dutch


	Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
	& [Dat Nguyen](https://www.linkedin.com/in/dat-nguyen-49a641138/) during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
	See also the fine-tuned [t5-base-dutch-demo](https://huggingface.co/flax-community/t5-base-dutch-demo) model,
	and the demo application [Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer),
	that are based on this model.

	5 jan 2022: Model updated. Evaluation accuracy increased from 0.64 to 0.70.

	11 jan 2022: See also [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) with eval acc 0.78


	This t5 model has 222M parameters.
	It was pre-trained on the dataset
	`mc4_nl_cleaned` config `full` for 1 epoch(s) and a duration of 2d9h,
	with a sequence length of 512, batch size 128 and 527500 total steps.
	Pre-training evaluation loss and accuracy are 1,38 and 0,70.
	After fine-tuning on 25K samples of Dutch CNN summarization, the Rouge1 score is 33.0
	(note: this evaluation model was not saved).

	* Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
	* For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for
	the [Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer) example application!

	Please refer to the original T5 papers and Scale Efficiently papers for more information about the T5 architecture
	and configs, though it must be noted that this model (t5-base-dutch) is unrelated to these projects and not an 'official' checkpoint.
	* [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
	* [Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686) by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler.


	![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)


	## Tokenizer

	The model uses a cased SentencePiece tokenizer configured with the `Nmt, NFKC, Replace multi-space to single-space` normalizers
	and has 32003 tokens.
	It was trained on Dutch mc4 with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
	See [./raw/main/tokenizer.json](tokenizer.json) for details.

	## Dataset

	All models listed below are trained on
	[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
	which is the original mC4, except

	* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
	* Sentences with less than 3 words are removed
	* Sentences with a word of more than 1000 characters are removed
	* Documents with less than 5 sentences are removed
	* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
	"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

	The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.

	## Models

	Three types of models have been trained. `t5-base-dutch` is the only model with an original T5 config.
	The other model types t5-v1.1 and t5-eff have `gated-relu` instead of `relu` as activation function,
	and trained with a drop-out of `0.0` unless training would diverge (`t5-v1.1-large-dutch-cased`).
	The T5-eff models are models with mostly different numbers of layers. The table will list
	the several dimensions of these models. Note that `efficient` is a misnomer for models with few layers,
	e.g. `t5-xl-4L-dutch-english-cased`, that is not efficient and one of the worst models on downstream summarization.

	\| \| t5-base-dutch \| t5-v1.1-base-dutch-uncased \| t5-v1.1-base-dutch-cased \| t5-v1.1-large-dutch-cased \| t5-v1_1-base-dutch-english-cased \| t5-v1_1-base-dutch-english-cased-1024 \| t5-small-24L-dutch-english \| t5-xl-4L-dutch-english-cased \| t5-base-36L-dutch-english-cased \| t5-eff-xl-8l-dutch-english-cased \| t5-eff-large-8l-dutch-english-cased \|
	\|:------------------\|:----------------\|:-----------------------------\|:---------------------------\|:----------------------------\|:-----------------------------------\|:----------------------------------------\|:-----------------------------\|:-------------------------------\|:----------------------------------\|:-----------------------------------\|:--------------------------------------\|
	\| type \| t5 \| t5-v1.1 \| t5-v1.1 \| t5-v1.1 \| t5-v1.1 \| t5-v1.1 \| t5 eff \| t5 eff \| t5 eff \| t5 eff \| t5 eff \|
	\| d_model \| 768 \| 768 \| 768 \| 1024 \| 768 \| 768 \| 512 \| 2048 \| 768 \| 1024 \| 1024 \|
	\| d_ff \| 3072 \| 2048 \| 2048 \| 2816 \| 2048 \| 2048 \| 1920 \| 5120 \| 2560 \| 16384 \| 4096 \|
	\| num_heads \| 12 \| 12 \| 12 \| 16 \| 12 \| 12 \| 8 \| 32 \| 12 \| 32 \| 16 \|
	\| d_kv \| 64 \| 64 \| 64 \| 64 \| 64 \| 64 \| 64 \| 64 \| 64 \| 128 \| 64 \|
	\| num_layers \| 12 \| 12 \| 12 \| 24 \| 12 \| 12 \| 24 \| 4 \| 36 \| 8 \| 8 \|
	\| num parameters \| 223M \| 248M \| 248M \| 783M \| 248M \| 248M \| 250M \| 585M \| 729M \| 1241M \| 335M \|
	\| feed_forward_proj \| relu \| gated-gelu \| gated-gelu \| gated-gelu \| gated-gelu \| gated-gelu \| gated-gelu \| gated-gelu \| gated-gelu \| gated-gelu \| gated-gelu \|
	\| dropout \| 0.1 \| 0.0 \| 0.0 \| 0.1 \| 0.0 \| 0.0 \| 0.0 \| 0.1 \| 0.0 \| 0.0 \| 0.0 \|
	\| dataset \| mc4_nl_cleaned \| mc4_nl_cleaned full \| mc4_nl_cleaned full \| mc4_nl_cleaned \| mc4_nl_cleaned small_en_nl \| mc4_nl_cleaned large_en_nl \| mc4_nl_cleaned large_en_nl \| mc4_nl_cleaned large_en_nl \| mc4_nl_cleaned large_en_nl \| mc4_nl_cleaned large_en_nl \| mc4_nl_cleaned large_en_nl \|
	\| tr. seq len \| 512 \| 1024 \| 1024 \| 512 \| 512 \| 1024 \| 512 \| 512 \| 512 \| 512 \| 512 \|
	\| batch size \| 128 \| 64 \| 64 \| 64 \| 128 \| 64 \| 128 \| 512 \| 512 \| 64 \| 128 \|
	\| total steps \| 527500 \| 1014525 \| 1210154 \| 2427498 \| 2839630 \| 1520k/3397024 \| 851852 \| 212963 \| 212963 \| 538k/1703705 \| 851850 \|
	\| epochs \| 1 \| 2 \| 2 \| 2 \| 10 \| 4 \| 1 \| 1 \| 1 \| 1 \| 1 \|
	\| duration \| 2d9h \| 5d5h \| 6d6h \| 8d13h \| 11d18h \| 9d1h \| 4d10h \| 6d1h \| 17d15h \| 4d 19h \| 3d 23h \|
	\| optimizer \| adafactor \| adafactor \| adafactor \| adafactor \| adafactor \| adafactor \| adafactor \| adafactor \| adafactor \| adafactor \| adafactor \|
	\| lr \| 0.005 \| 0.005 \| 0.005 \| 0.005 \| 0.005 \| 0.005 \| 0.005 \| 0.005 \| 0.009 \| 0.005 \| 0.005 \|
	\| warmup \| 10000.0 \| 10000.0 \| 10000.0 \| 10000.0 \| 10000.0 \| 5000.0 \| 20000.0 \| 2500.0 \| 1000.0 \| 1500.0 \| 1500.0 \|
	\| eval loss \| 1,38 \| 1,20 \| 0,96 \| 1,07 \| 1,11 \| 1,13 \| 1,18 \| 1,27 \| 1,05 \| 1,3019 \| 1,15 \|
	\| eval acc \| 0,70 \| 0,73 \| 0,78 \| 0,76 \| 0,75 \| 0,74 \| 0,74 \| 0,72 \| 0,76 \| 0,71 \| 0,74 \|

	## Evaluation on summarization

	The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset.
	All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a
	warmup of 64 steps, with a label smoothing factor of 0.05.
	Article and summary token lengths were set to 1024 and 142.

	\| \| t5-base-dutch \| t5-v1.1-base-dutch-uncased \| t5-v1.1-base-dutch-cased \| t5-v1_1-base-dutch-english-cased \| t5-v1_1-base-dutch-english-cased-1024 \| t5-small-24L-dutch-english \| t5-xl-4L-dutch-english-cased \| t5-base-36L-dutch-english-cased \| t5-eff-large-8l-dutch-english-cased \| mt5-base \|
	\|:-------------------\|:----------------\|:-----------------------------\|:---------------------------\|:-----------------------------------\|:----------------------------------------\|:-----------------------------\|:-------------------------------\|:----------------------------------\|:--------------------------------------\|:-----------\|
	\| rouge1 \| 33.0313 \| 33.8432 \| 34.0906 \| 33.1116 \| 34.6465 \| 34.376 \| 30.8983 \| 35.0931 \| 33.9293 \| 33.6466 \|
	\| rouge2 \| 12.9452 \| 13.7706 \| 13.6203 \| 13.275 \| 13.8525 \| 13.8939 \| 11.6005 \| 14.3823 \| 13.6274 \| 13.1085 \|
	\| rougeL \| 23.7204 \| 24.5642 \| 24.7304 \| 24.3561 \| 24.721 \| 25.2496 \| 22.6536 \| 25.3213 \| 24.5595 \| 23.909 \|
	\| rougeLsum \| 29.842 \| 30.7783 \| 31.1438 \| 30.0548 \| 31.6104 \| 31.3838 \| 27.8467 \| 32.3526 \| 30.952 \| 30.5054 \|
	\| gen_len \| 90.488 \| 91.832 \| 92.122 \| 89.583 \| 98.333 \| 90.442 \| 92.342 \| 96.832 \| 95.057 \| 96.312 \|
	\| num parameters \| 223M \| 248M \| 248M \| 248M \| 248M \| 250M \| 585M \| 729M \| 335M \| 582M \|
	\| samples_per_second \| 3.195 \| 3.039 \| 3.0 \| 3.216 \| 2.974 \| 1.594 \| 2.47 \| 0.623 \| 3.087 \| 1.201 \|

	## Translation models

	The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset.
	The models named *-`multi` support both directions of translation. The models are trained on CCMatrix only. As this is
	a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it,
	refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also
	on Tatoeba and Opus Books. The `_bp` columns list the brevity penalty. The `avg_bleu` score is the bleu score
	averaged over all three evaluation datasets.

	The translation metrics are listed in the table below:

	\| \| t5-base-36L-ccmatrix-en-nl \| t5-base-36L-ccmatrix-multi \| t5-base-36L-ccmatrix-multi \| t5-small-24L-ccmatrix-multi \| t5-small-24L-ccmatrix-multi \|
	\|:-----------------------\|:-----------------------------\|:-----------------------------\|:-----------------------------\|:------------------------------\|:------------------------------\|
	\| id \| 0 \| 14 \| 15 \| 16 \| 20 \|
	\| source_lang \| en \| en \| nl \| en \| nl \|
	\| target_lang \| nl \| nl \| en \| nl \| en \|
	\| source_prefix \| translate English to Dutch: \| translate English to Dutch: \| translate Dutch to English: \| translate English to Dutch: \| translate Dutch to English: \|
	\| tatoeba_bp \| 0.9897614370103832 \| 0.9736173618072754 \| 0.943521164106552 \| 0.9760983304454847 \| 0.9406676405486575 \|
	\| ccmatrix_bp \| 0.9590750786190209 \| 0.9536276245543676 \| 0.9635673583308255 \| 0.9517934939463099 \| 0.9585648049711814 \|
	\| opus_books_bp \| 0.7478011343203491 \| 0.7950194726093107 \| 0.9362852511299413 \| 0.770498474692027 \| 0.8870675076932444 \|
	\| tatoeba_score \| 50.63006965176505 \| 46.580601850286214 \| 52.82030981131822 \| 46.419809813946046 \| 51.67887417355214 \|
	\| ccmatrix_score \| 60.33227938980884 \| 56.81297258845844 \| 62.836646082246254 \| 57.404319674892406 \| 63.08633155239932 \|
	\| opus_books_score \| 10.405013868050663 \| 13.477997378535864 \| 24.93113308798125 \| 12.927244801365507 \| 23.418552148252047 \|
	\| avg_bleu \| 40.455787636541515 \| 38.95719060576017 \| 46.86269632718191 \| 38.91712476340132 \| 46.0612526247345 \|
	\| total steps \| 78125 \| 390625 \| 390625 \| 390625 \| 390625 \|
	\| duration \| 14h \| 101h \| 101h \| 74h \| 74h \|
	\| num_parameters \| 728928000 \| 728928000 \| 728928000 \| 249991680 \| 249991680 \|
	\| label_smoothing_factor \| 0.09 \| 0.15 \| 0.15 \| 0.1 \| 0.1 \|
	\| learning_rate \| 0.0001 \| 5e-05 \| 5e-05 \| 0.0005 \| 0.0005 \|

	## Acknowledgements

	This project would not have been possible without compute generously provided by Google through the
	[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem and was also
	instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many
	models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would
	have completed this project otherwise.
	The following repositories where helpful in setting up the TPU-VM,
	and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.

	* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
	* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)

	Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)