|
--- |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- ca |
|
- ceb |
|
- co |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fil |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- haw |
|
- he |
|
- hi |
|
- hmn |
|
- ht |
|
- hu |
|
- hy |
|
- id |
|
- ig |
|
- is |
|
- it |
|
- iw |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lb |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mi |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- mt |
|
- my |
|
- ne |
|
- nl |
|
- 'no' |
|
- ny |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- sm |
|
- sn |
|
- so |
|
- sq |
|
- sr |
|
- st |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- tg |
|
- th |
|
- tr |
|
- uk |
|
- und |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- yo |
|
- zh |
|
- zu |
|
license: mit |
|
datasets: |
|
- mc4 |
|
--- |
|
|
|
# MyT5 |
|
|
|
|
|
|
|
## Model Details |
|
|
|
MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture. |
|
The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf). |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer |
|
- **Funded by:** University of Washington Fellowship, Charles University Grant Agency |
|
- **Model type:** T5 |
|
- **Language(s) (NLP):** Multilingual |
|
- **License:** MIT |
|
|
|
### Model Sizes |
|
|
|
- **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters |
|
- **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters |
|
- **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **[Repository](https://github.com/tomlimi/MYTE)** |
|
- **[Paper](https://arxiv.org/pdf/2403.10691.pdf)** |
|
|
|
## How to Get Started with the Model |
|
|
|
The snippet below shows the basic usage of the model for multilingual language modeling. |
|
Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`. |
|
We also plan to release it on HuggingFace in the future. |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration |
|
from src.myt5.myt5_tokenizer import MyT5Tokenizer |
|
import torch |
|
|
|
MODEL_SIZE = "large" # small, base, or large |
|
|
|
model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True) |
|
tokenizer = MyT5Tokenizer() |
|
|
|
pre_texts = ['"We now have', |
|
'„Mamy teraz myszy w wieku', |
|
'"""எங்களிடம் இப்போது'] |
|
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.', |
|
'4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.', |
|
'4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."'] |
|
|
|
inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt") |
|
targets = tokenizer(post_texts, padding="longest", return_tensors="pt") |
|
|
|
|
|
outputs = model(**inputs, labels=targets.input_ids) |
|
probs = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset. |
|
|
|
### Preprocessing |
|
|
|
Instead of UTF-8 bytes, we used morphologically-driven byte representation. |
|
See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details. |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
We used the same hyperparameters as in the original ByT5 paper. |
|
The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting. |
|
|
|
### Computational Infrastructure |
|
|
|
Models were trained on TPUs available through TPU Research Cloud (TRC). |
|
We used v3-8 TPU for training small and base models and v3-32 for a large model. |
|
The training for each instance took: |
|
|
|
- **Small**: 90h |
|
- **Base**: 230h |
|
- **Large**: 190h |
|
|
|
# Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps. |
|
|
|
## Language Modeling |
|
|
|
We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus. |
|
To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB). |
|
|
|
### Results |
|
|
|
| | | ByT5 | | MyT5 | | |
|
|-------|-----------|------|--------|------|--------| |
|
| | | BPEB | T (ms) | BPEB | T (ms) | |
|
| small | All | 10.1 | 7.0 | 4.6 | 6.7 | |
|
| | Latin | 4.6 | 5.9 | 4.2 | 6.6 | |
|
| | Non Latin | 18.1 | 8.5 | 5.1 | 6.8 | |
|
| base | All | 8.2 | 11.5 | 5.8 | 8.9 | |
|
| | Latin | 4.9 | 9.4 | 5.0 | 8.7 | |
|
| | Non Latin | 13.0 | 14.6 | 6.9 | 9.1 | |
|
| large | All | 13.4 | 31.8 | 4.6 | 26.7 | |
|
| | Latin | 10.1 | 28.1 | 4.0 | 26.6 | |
|
| | Non Latin | 18.2 | 37.3 | 5.4 | 27.0 | |
|
|
|
Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. |
|
The inference was run on an A40 GPU core. |
|
|
|
## Downstream Tasks |
|
|
|
We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation. |
|
The test data come from XTREME-UP benchmark ([Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf)), which covers mainly low-resource languages |
|
|
|
### Fine-tuning |
|
|
|
In each task, we fine-tuned for all languages jointly. |
|
We used 1e-3 learning rate with square root decay and dropout of 0.1. |
|
The batch size and training varied across tasks: |
|
|
|
- **NER**: 128 examples per batch, 6000 steps |
|
- **QA**: 64 examples per batch, 6500 steps |
|
- **Semantic Parsing**: 64 examples per batch, 1000 steps |
|
- **MT**: 64 examples per batch, 10000 steps |
|
|
|
|
|
### Results |
|
|
|
Task | QA (F1) | NER (F1) | Semantic Parsing (EM)| MT (chrF) |
|
------------|------|------|------------------|------ |
|
Flan-PaLM* | 22.9 | 12.0 | 0.1 | --- |
|
mT5* | 59.7 | 74.0 | 21.8 | --- |
|
ByT5 | 73.2 | 81.5 | 25.1 | 20.1 |
|
MyT5 | 75.3 | 80.8 | 19.6 | 20.4 |
|
Inference Times per example (ms) |
|
ByT5 | 36.2 | 13.8 | 13.2 | 15.9 |
|
MyT5 | 35.6 | 12.6 | 12.4 | 12.6 |
|
|
|
The average result of XTREME-UP tasks across low-resource languages. |
|
The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in [Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf). |
|
The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{limisiewicz2024myte, |
|
title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, |
|
author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer}, |
|
year={2024}, |
|
eprint={2403.10691}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
|
|
## Model Card Author |
|
|
|
[Tomasz Limisiewicz](mailto:[email protected])--- |
|
license: mit |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- ca |
|
- ceb |
|
- co |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fil |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- haw |
|
- he |
|
- hi |
|
- hmn |
|
- ht |
|
- hu |
|
- hy |
|
- id |
|
- ig |
|
- is |
|
- it |
|
- iw |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lb |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mi |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- mt |
|
- my |
|
- ne |
|
- nl |
|
- 'no' |
|
- ny |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- sm |
|
- sn |
|
- so |
|
- sq |
|
- sr |
|
- st |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- tg |
|
- th |
|
- tr |
|
- uk |
|
- und |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- yo |
|
- zh |
|
- zu |
|
datasets: |
|
- mc4 |
|
--- |
|
|
|
# MyT5 |
|
|
|
|
|
|
|
## Model Details |
|
|
|
MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture. |
|
The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf). |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer |
|
- **Funded by:** University of Washington Fellowship, Charles University Grant Agency |
|
- **Model type:** T5 |
|
- **Language(s) (NLP):** Multilingual |
|
- **License:** MIT |
|
|
|
### Model Sizes |
|
|
|
- **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters |
|
- **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters |
|
- **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **[Repository](https://github.com/tomlimi/MYTE)** |
|
- **[Paper](https://arxiv.org/pdf/2403.10691.pdf)** |
|
|
|
## How to Get Started with the Model |
|
|
|
The snippet below shows the basic usage of the model for multilingual language modeling. |
|
Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`. |
|
We also plan to release it on HuggingFace in the future. |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration |
|
from src.myt5.myt5_tokenizer import MyT5Tokenizer |
|
import torch |
|
|
|
MODEL_SIZE = "large" # small, base, or large |
|
|
|
model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True) |
|
tokenizer = MyT5Tokenizer() |
|
|
|
pre_texts = ['"We now have', |
|
'„Mamy teraz myszy w wieku', |
|
'"""எங்களிடம் இப்போது'] |
|
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.', |
|
'4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.', |
|
'4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."'] |
|
|
|
inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt") |
|
targets = tokenizer(post_texts, padding="longest", return_tensors="pt") |
|
|
|
|
|
outputs = model(**inputs, labels=targets.input_ids) |
|
probs = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset. |
|
|
|
### Preprocessing |
|
|
|
Instead of UTF-8 bytes, we used morphologically-driven byte representation. |
|
See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details. |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
We used the same hyperparameters as in the original ByT5 paper. |
|
The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting. |
|
|
|
### Computational Infrastructure |
|
|
|
Models were trained on TPUs available through TPU Research Cloud (TRC). |
|
We used v3-8 TPU for training small and base models and v3-32 for a large model. |
|
The training for each instance took: |
|
|
|
- **Small**: 90h |
|
- **Base**: 230h |
|
- **Large**: 190h |
|
|
|
# Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps. |
|
|
|
## Language Modeling |
|
|
|
We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus. |
|
To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB). |
|
|
|
### Results |
|
|
|
| | | ByT5 | | MyT5 | | |
|
|-------|-----------|------|--------|------|--------| |
|
| | | BPEB | T (ms) | BPEB | T (ms) | |
|
| small | All | 10.1 | 7.0 | 4.6 | 6.7 | |
|
| | Latin | 4.6 | 5.9 | 4.2 | 6.6 | |
|
| | Non Latin | 18.1 | 8.5 | 5.1 | 6.8 | |
|
| base | All | 8.2 | 11.5 | 5.8 | 8.9 | |
|
| | Latin | 4.9 | 9.4 | 5.0 | 8.7 | |
|
| | Non Latin | 13.0 | 14.6 | 6.9 | 9.1 | |
|
| large | All | 13.4 | 31.8 | 4.6 | 26.7 | |
|
| | Latin | 10.1 | 28.1 | 4.0 | 26.6 | |
|
| | Non Latin | 18.2 | 37.3 | 5.4 | 27.0 | |
|
|
|
Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. |
|
The inference was run on an A40 GPU core. |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{limisiewicz2024myte, |
|
title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, |
|
author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer}, |
|
year={2024}, |
|
eprint={2403.10691}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
|
|
## Model Card Author |
|
|
|
[Tomasz Limisiewicz](mailto:[email protected]) |
|
|