seanghay
/

xlm-roberta-khmer-small

Inference Endpoints

Model card Files Files and versions Community

xlm-roberta-khmer-small / README.md

seanghay's picture

Update README.md

47dc124 verified 4 months ago

|

history blame contribute delete

1.64 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- km
	pipeline_tag: fill-mask
	---

	# XLMRoBERTa for Khmer Language

	Training from scratch using Masked Language Modeling task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps.

	Training data is created by crawling publicly available publicly news sites and Wikipedia.


	## Why?

	1. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is big. (279M parameters, while this is only 49M parameters).
	2. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is not optimized for Khmer language.
	3. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) Vocab size is bigger (250,002) and this model uses 8000 vocab size.

	## Usage


	```python
	from transformers import pipeline

	pipe = pipeline("fill-mask", "seanghay/xlm-roberta-khmer-small")

	result = pipe("សួស្ដីកម្ពុ<mask>!")
	print(result)
	```

	```python
	[
	{"score": 0.8130345344543457, "token": 11, "token_str": "ជា", "sequence": "សួស្ដីកម្ពុជា!"},
	{"score": 0.17512884736061096, "token": 160, "token_str": "ជ", "sequence": "សួស្ដីកម្ពុជ!"},
	{"score": 0.0034702506382018328, "token": 143, "token_str": "ជា", "sequence": "សួស្ដីកម្ពុ ជា!"},
	{"score": 0.00305828545242548, "token": 16, "token_str": "រ", "sequence": "សួស្ដីកម្ពុរ!"},
	{"score": 0.0007526700501330197, "token": 133, "token_str": "គ", "sequence": "សួស្ដីកម្ពុគ!"},
	]
	```

	## License

	`Apache-2.0`

	## Citation

	No need. :)