rasyosef
/

bert-amharic-tokenizer

Inference Endpoints

Model card Files Files and versions Community

bert-amharic-tokenizer / README.md

rasyosef's picture

Create README.md

eb373ec verified 10 months ago

|

history blame contribute delete

1.18 kB

	---
	license: mit
	datasets:
	- oscar
	- mc4
	language:
	- am
	library_name: transformers
	---
	# Amharic WordPiece Tokenizer
	This repo contains a WordPiece tokenizer trained on the Amharic subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the BERT tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`.

	# How to use
	You can load the tokenizer from huggingface hub as follows.
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
	tokenizer.tokenize("የዓለምአቀፉ ነጻ ንግድ መስፋፋት ድህነትን ለማሸነፍ በሚደረገው ትግል አንዱ ጠቃሚ መሣሪያ ሊሆን መቻሉ ብዙ የሚነገርለት ጉዳይ ነው።")
	```

	Output:
	```python
	['የዓለም', '##አቀፉ', 'ነጻ', 'ንግድ', 'መስፋፋት', 'ድህነትን', 'ለማሸነፍ', 'በሚደረገው', 'ትግል', 'አንዱ', 'ጠቃሚ', 'መሣሪያ', 'ሊሆን', 'መቻሉ', 'ብዙ', 'የሚነገርለት', 'ጉዳይ', 'ነው', '።']
	```