hmByT5 - Preliminary Language Models

Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

English (British Library Corpus - Books)
German (Europeana Newspaper)
French (Europeana Newspaper)
Finnish (Europeana Newspaper)
Swedish (Europeana Newspaper)
Dutch (Delpher Corpus)

More details can be found in our GitHub repository.

Pretraining

We pretrain hmByT5 on a v3-32 TPU Pod. Details about the training can be found here.

Evaluation on Downstream Tasks (NER)

We evaluated the hmByT5 model that was pretrained on English AjMC corpus for 200k steps:

Hyper-param Configuration	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
`wsFalse-bs4-e10-lr0.00016-poolingfirst`	83.80	84.78	83.74	83.35	84.37	84.01 ± 0.50
`wsFalse-bs4-e10-lr0.00015-poolingfirst`	84.67	82.69	83.92	84.53	82.90	83.74 ± 0.82
`wsFalse-bs8-e10-lr0.00016-poolingfirst`	82.12	83.82	83.37	83.00	83.70	83.20 ± 0.61
`wsFalse-bs8-e10-lr0.00015-poolingfirst`	83.45	82.83	84.15	81.76	83.78	83.19 ± 0.84

It turns out, that the results are not on-par with current SOTA on the English AjMC corpus, see a comparison here. Thus, we continue experiments with the Hugging Face Transformers JAX/FLAX implementation to pretrain ByT5 models on TPU.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️