license: mit | |
datasets: | |
- oscar | |
- mc4 | |
language: | |
- am | |
library_name: transformers | |
# Amharic WordPiece Tokenizer | |
This repo contains a **WordPiece** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the **BERT** tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`. | |
# How to use | |
You can load the tokenizer from huggingface hub as follows. | |
```python | |
from transformers import AutoTokenizer | |
tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer") | |
tokenizer.tokenize("α¨αααα αα αα» ααα΅ αα΅ααα΅ α΅α αα΅α αααΈαα α αα°α¨αα α΅αα α αα± α αα αα£αͺα« ααα αα»α α₯α α¨αααααα΅ αα³α ααα’") | |
``` | |
Output: | |
```python | |
['α¨ααα', '##α αα', 'αα»', 'ααα΅', 'αα΅ααα΅', 'α΅α αα΅α', 'αααΈαα', 'α αα°α¨αα', 'α΅αα', 'α αα±', 'α αα', 'αα£αͺα«', 'ααα', 'αα»α', 'α₯α', 'α¨αααααα΅', 'αα³α', 'αα', 'α’'] | |
``` |