it5-small / README.md
gsarti's picture
Update README.md
58d5111 verified
|
raw
history blame
6.29 kB
metadata
language:
  - it
datasets:
  - gsarti/clean_mc4_it
tags:
  - seq2seq
  - lm-head
license: apache-2.0
inference: false
thumbnail: https://gsarti.com/publication/it5/featured.png

Italian T5 Small ๐Ÿ‡ฎ๐Ÿ‡น

The IT5 model family represents the first effort in pretraining large-scale sequence-to-sequence transformer models for the Italian language, following the approach adopted by the original T5 model.

This model is released as part of the project "IT5: Text-to-Text Pretraining for Italian Language Understanding and Generation", by Gabriele Sarti and Malvina Nissim with the support of Huggingface and with TPU usage sponsored by Google's TPU Research Cloud. All the training was conducted on a single TPU3v8-VM machine on Google Cloud. Refer to the Tensorboard tab of the repository for an overview of the training process.

The inference widget is deactivated because the model needs a task-specific seq2seq fine-tuning on a downstream task to be useful in practice. The models in the it5 organization provide some examples of this model fine-tuned on various downstream task.

Model variants

This repository contains the checkpoints for the base version of the model. The model was trained for one epoch (1.05M steps) on the Thoroughly Cleaned Italian mC4 Corpus (~41B words, ~275GB) using ๐Ÿค— Datasets and the google/t5-v1_1-small improved configuration. The training procedure is made available on Github.

The following table summarizes the parameters for all available models

it5-small (this one) it5-base it5-large it5-base-oscar
dataset gsarti/clean_mc4_it gsarti/clean_mc4_it gsarti/clean_mc4_it oscar/unshuffled_deduplicated_it
architecture google/t5-v1_1-small google/t5-v1_1-base google/t5-v1_1-large t5-base
learning rate 5e-3 5e-3 5e-3 1e-2
steps 1'050'000 1'050'000 2'100'000 258'000
training time 36 hours 101 hours 370 hours 98 hours
ff projection gated-gelu gated-gelu gated-gelu relu
tie embeds false false false true
optimizer adafactor adafactor adafactor adafactor
max seq. length 512 512 512 512
per-device batch size 16 16 8 16
tot. batch size 128 128 64 128
weigth decay 1e-3 1e-3 1e-2 1e-3
validation split size 15K examples 15K examples 15K examples 15K examples

The high training time of it5-base-oscar was due to a bug in the training script.

For a list of individual model parameters, refer to the config.json file in the respective repositories.

Using the models

from transformers import AutoTokenzier, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gsarti/it5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("gsarti/it5-small")

Note: You will need to fine-tune the model on your downstream seq2seq task to use it. See an example here.

Flax and Tensorflow versions of the model are also available:

from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-small")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-small")

Limitations

Due to the nature of the web-scraped corpus on which IT5 models were trained, it is likely that their usage could reproduce and amplify pre-existing biases in the data, resulting in potentially harmful content such as racial or gender stereotypes and conspiracist views. For this reason, the study of such biases is explicitly encouraged, and model usage should ideally be restricted to research-oriented and non-user-facing endeavors.

Model curators

For problems or updates on this model, please contact [email protected].

Citation Information

@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
}