|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- km |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# XLMRoBERTa for Khmer Language |
|
|
|
Training from scratch using **Masked Language Modeling** task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps. |
|
|
|
Training data is created by crawling publicly available publicly news sites and Wikipedia. |
|
|
|
|
|
## Why? |
|
|
|
1. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is big. (279M parameters, while this is only 49M parameters). |
|
2. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is not optimized for Khmer language. |
|
3. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) Vocab size is bigger (250,002) and this model uses 8000 vocab size. |
|
|
|
## Usage |
|
|
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("fill-mask", "seanghay/xlm-roberta-khmer-small") |
|
|
|
result = pipe("αα½ααααΈααααα»<mask>!") |
|
print(result) |
|
``` |
|
|
|
```python |
|
[ |
|
{"score": 0.8130345344543457, "token": 11, "token_str": "ααΆ", "sequence": "αα½ααααΈααααα»ααΆ!"}, |
|
{"score": 0.17512884736061096, "token": 160, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"}, |
|
{"score": 0.0034702506382018328, "token": 143, "token_str": "ααΆ", "sequence": "αα½ααααΈααααα» ααΆ!"}, |
|
{"score": 0.00305828545242548, "token": 16, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"}, |
|
{"score": 0.0007526700501330197, "token": 133, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"}, |
|
] |
|
``` |
|
|
|
## License |
|
|
|
`Apache-2.0` |
|
|
|
## Citation |
|
|
|
No need. :) |
|
|