hmByT5 - Preliminary Language Models
Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:
- English (British Library Corpus - Books)
- German (Europeana Newspaper)
- French (Europeana Newspaper)
- Finnish (Europeana Newspaper)
- Swedish (Europeana Newspaper)
- Dutch (Delpher Corpus)
More details can be found in our GitHub repository.
In this experiment we sample 4B bytes (~4GB of text) from each corpora (and upsample Swedish and Finnish) and train for another epoch (2 epochs in total).
Pretraining
We use the official JAX/FLAX example in Hugging Face Transformers to pretrain a ByT5 model on a single v3-8 TPU. Details about the training can be found here.
Evaluation on Downstream Tasks (NER)
We evaluated the hmByT5 model on downstream tasks:
Model | English AjMC | German AjMC | French AjMC | Finnish NewsEye | Swedish NewsEye | Dutch ICDAR | French ICDAR | Avg. |
---|---|---|---|---|---|---|---|---|
hmbyt5-preliminary/byt5-small-multilingual-4g-2e |
83.86 ± 0.61 | 87.54 ± 0.19 | 84.29 ± 0.41 |
Acknowledgements
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️
- Downloads last month
- 241
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.