File size: 1,176 Bytes

eb373ec

---
license: mit
datasets:
- oscar
- mc4
language:
- am
library_name: transformers
---
# Amharic WordPiece Tokenizer
This repo contains a **WordPiece** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the **BERT** tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`.

# How to use
You can load the tokenizer from huggingface hub as follows.
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("የዓለምአቀፉ ነጻ ንግድ መስፋፋት ድህነትን ለማሸነፍ በሚደረገው ትግል አንዱ ጠቃሚ መሣሪያ ሊሆን መቻሉ ብዙ የሚነገርለት ጉዳይ ነው።")
```

Output:
```python
['የዓለም', '##አቀፉ', 'ነጻ', 'ንግድ', 'መስፋፋት', 'ድህነትን', 'ለማሸነፍ', 'በሚደረገው', 'ትግል', 'አንዱ', 'ጠቃሚ', 'መሣሪያ', 'ሊሆን', 'መቻሉ', 'ብዙ', 'የሚነገርለት', 'ጉዳይ', 'ነው', '።']
```