README.md · gaodrew/cicero at main

metadata

library_name: transformers
license: apache-2.0
datasets:
  - Fece228/latin-literature-dataset-170M
language:
  - la

Pretrained from scratch using GPT-2 architecture and a dataset of Latin texts (Corpus Corporum) 64 token context, loss 4.5, trained on 1 epoch of 492 million tokens GPT2 style tokenizer trained with min_frequency of 2000

Tends to get repetitive and is not very coherent, due to size and limited data.