File size: 1,176 Bytes
eb373ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
---
license: mit
datasets:
- oscar
- mc4
language:
- am
library_name: transformers
---
# Amharic WordPiece Tokenizer
This repo contains a **WordPiece** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the **BERT** tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`.
# How to use
You can load the tokenizer from huggingface hub as follows.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("α¨αααα αα αα» ααα΅ αα΅ααα΅ α΅α
αα΅α αααΈαα α αα°α¨αα α΅αα α αα± α αα αα£αͺα« ααα αα»α α₯α α¨αααααα΅ αα³α ααα’")
```
Output:
```python
['α¨ααα', '##α αα', 'αα»', 'ααα΅', 'αα΅ααα΅', 'α΅α
αα΅α', 'αααΈαα', 'α αα°α¨αα', 'α΅αα', 'α αα±', 'α αα', 'αα£αͺα«', 'ααα', 'αα»α', 'α₯α', 'α¨αααααα΅', 'αα³α', 'αα', 'α’']
``` |