Update README.md
Browse files
README.md
CHANGED
@@ -12,6 +12,8 @@ tags:
|
|
12 |
|
13 |
[SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary. It helps process languages that don't separate words. All transformer models in the `transformers` library that use SentencePiece use it in combination with unigram. Examples of models using SentencePiece are [ALBERT](https://huggingface.co/docs/transformers/en/model_doc/albert), [XLNet](https://huggingface.co/docs/transformers/en/model_doc/xlnet), [Marian](https://huggingface.co/docs/transformers/en/model_doc/marian), and [T5](https://huggingface.co/docs/transformers/en/model_doc/t5).
|
14 |
|
|
|
|
|
15 |
### Use
|
16 |
|
17 |
```python
|
|
|
12 |
|
13 |
[SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary. It helps process languages that don't separate words. All transformer models in the `transformers` library that use SentencePiece use it in combination with unigram. Examples of models using SentencePiece are [ALBERT](https://huggingface.co/docs/transformers/en/model_doc/albert), [XLNet](https://huggingface.co/docs/transformers/en/model_doc/xlnet), [Marian](https://huggingface.co/docs/transformers/en/model_doc/marian), and [T5](https://huggingface.co/docs/transformers/en/model_doc/t5).
|
14 |
|
15 |
+
This tokenizer was trained with `vocab_size=25000` and `min_frequency=2`.
|
16 |
+
|
17 |
### Use
|
18 |
|
19 |
```python
|