--- library_name: transformers license: apache-2.0 language: - km pipeline_tag: fill-mask --- # XLMRoBERTa for Khmer Language Training from scratch using **Masked Language Modeling** task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps. Training data is created by crawling publicly available publicly news sites and Wikipedia. ## Why? 1. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is big. (279M parameters, while this is only 49M parameters). 2. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) is not optimized for Khmer language. 3. [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) Vocab size is bigger (250,002) and this model uses 8000 vocab size. ## Usage ```python from transformers import pipeline pipe = pipeline("fill-mask", "seanghay/xlm-roberta-khmer-small") result = pipe("សួស្ដីកម្ពុ!") print(result) ``` ```python [ {"score": 0.8130345344543457, "token": 11, "token_str": "ជា", "sequence": "សួស្ដីកម្ពុជា!"}, {"score": 0.17512884736061096, "token": 160, "token_str": "ជ", "sequence": "សួស្ដីកម្ពុជ!"}, {"score": 0.0034702506382018328, "token": 143, "token_str": "ជា", "sequence": "សួស្ដីកម្ពុ ជា!"}, {"score": 0.00305828545242548, "token": 16, "token_str": "រ", "sequence": "សួស្ដីកម្ពុរ!"}, {"score": 0.0007526700501330197, "token": 133, "token_str": "គ", "sequence": "សួស្ដីកម្ពុគ!"}, ] ``` ## License `Apache-2.0` ## Citation No need. :)