ancatmara
/

historical-irish-tokenizer-sentencepiece

Inference Endpoints

Model card Files Files and versions Community

ancatmara commited on Aug 14

Commit

4ba6318

•

1 Parent(s): 7d3a237

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -12,6 +12,8 @@ tags:
 [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary. It helps process languages that don't separate words. All transformer models in the `transformers` library that use SentencePiece use it in combination with unigram. Examples of models using SentencePiece are [ALBERT](https://huggingface.co/docs/transformers/en/model_doc/albert), [XLNet](https://huggingface.co/docs/transformers/en/model_doc/xlnet), [Marian](https://huggingface.co/docs/transformers/en/model_doc/marian), and [T5](https://huggingface.co/docs/transformers/en/model_doc/t5).
 ### Use
 ```python

 [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary. It helps process languages that don't separate words. All transformer models in the `transformers` library that use SentencePiece use it in combination with unigram. Examples of models using SentencePiece are [ALBERT](https://huggingface.co/docs/transformers/en/model_doc/albert), [XLNet](https://huggingface.co/docs/transformers/en/model_doc/xlnet), [Marian](https://huggingface.co/docs/transformers/en/model_doc/marian), and [T5](https://huggingface.co/docs/transformers/en/model_doc/t5).
+This tokenizer was trained with `vocab_size=25000` and `min_frequency=2`.
 ### Use
 ```python