--- license: cc-by-nc-4.0 language: - bo base_model: google-t5/t5-small tags: - nlp - transliteration - tibetan - buddhism datasets: - billingsmoore/tibetan-phonetic-transliteration-dataset --- # Model Card for tibetan-phonetic-transliteration This model is a text2text generation model for phonetic transliteration of Tibetan script. ## Model Details ### Model Description - **Developed by:** billingsmoore - **Model type:** text2text generation - **Language(s) (NLP):** Tibetan - **License:** [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International ) - **Finetuned from model:** ['google-t5/t5-small'](https://huggingface.co/google-t5/t5-small) ### Model Sources - **Repository:** [https://github.com/billingsmoore/MLotsawa](https://github.com/billingsmoore/MLotsawa) ## Uses The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem. ### Direct Use To use the model for transliteration in a python script, you can use the transformers library like so: ```python from transformers import pipeline transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration') transliterated_text = transliterator() ``` ### Downstream Use The model can be finetuned for a specific use case using the following code. ```python from datasets import load_dataset from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor from accelerate import Accelerator dataset = load_dataset() dataset = dataset['train'].train_test_split(.1) checkpoint = "billingsmoore/tibetan-phonetic-transliteration" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto") data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) source_lang = 'bo' target_lang = 'phon' def preprocess_function(examples): inputs = [example for example in examples[source_lang]] targets = [example for example in examples[target_lang]] model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length") return model_inputs tokenized_dataset = dataset.map(preprocess_function, batched=True) optimizer = Adafactor( model.parameters(), scale_parameter=True, relative_step=False, warmup_init=False, lr=3e-4 ) accelerator = Accelerator() model, optimizer = accelerator.prepare(model, optimizer) training_args = Seq2SeqTrainingArguments( output_dir=".", auto_find_batch_size=True, predict_with_generate=True, fp16=False, push_to_hub=False, eval_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True, num_train_epochs=5 ) trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=tokenized_dataset['train'], eval_dataset=tokenized_dataset['test'], tokenizer=tokenizer, optimizers=(optimizer, None), data_collator=data_collator ) trainer.train() ``` ## Bias, Risks, and Limitations This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan. It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan. ### Recommendations For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above. ## Training Details This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first. This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced. [You can find this dataset and more information on Kaggle by clicking here.](https://www.kaggle.com/datasets/billingsmoore/tibetan-phonetic-transliteration-pairs) [You can find this dataset and more information on Huggingface by clicking here.](https://huggingface.co/datasets/billingsmoore/tibetan-phonetic-transliteration-dataset) This model was trained for five epochs. Further information regarding training can be found in the documentation of the [MLotsawa repository](https://github.com/billingsmoore/MLotsawa). ## Model Card Contact billingsmoore [at] gmail [dot] com