ancatmara
/

historical-irish-tokenizer-wordpiece

Inference Endpoints

Model card Files Files and versions Community

historical-irish-tokenizer-wordpiece / README.md

ancatmara's picture

Update README.md

8c0e7a4 verified 3 months ago

|

2.47 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- ga
	- sga
	- mga
	- ghc
	- la
	library_name: transformers
	---

	Historical Irish WordPiece tokenizer was trained on Old, Middle, Early Modern, Classical Modern and pre-reform Modern Irish texts from St. Gall Glosses, Würzburg Glosses, CELT and the book subcorpus Historical Irish Corpus. The training data spans ca. 550 — 1926 and covers a wide variety of genres, such as bardic poetry, native Irish stories, translations and adaptations of continental epic and romance, annals, genealogies, grammatical and medical tracts, diaries, and religious writing. Due to code-switching in some texts, the model has some Latin in the vocabulary.

	WordPiece is the subword tokenization algorithm used for [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert), [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert), and [Electra](https://huggingface.co/docs/transformers/en/model_doc/electra). The algorithm was outlined in [Japanese and Korean Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and is very similar to BPE. WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.

	### Use

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("ancatmara/historical-irish-tokenizer-wordpiece")
	texts = ['Boí Óengus in n-aidchi n-aili inna chotlud.', 'Co n-accae ní, in n-ingin cucci for crunn síuil dó.']

	tokenizer(texts, max_length=128, truncation=True)
	```

	Out:

	```python
	>>> {'input_ids': [[0, 905, 2526, 158, 55, 18, 2561, 55, 18, 2259, 1676, 10924, 19, 2], [0, 154, 55, 18, 4457, 106, 207, 17, 158, 55, 18, 2139, 11166, 98, 222, 7499, 20032, 148, 19, 2]],
	'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
	'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
	```

	```python
	tokenizer.decode([0, 905, 2526, 158, 55, 18, 2561, 55, 18, 2259, 1676, 10924, 19, 2])
	```

	Out:

	```python
	>>> '<s> boi oengus in n - aidchi n - aili inna chotlud. </s>'
	```