w11wo
/

sundanese-roberta-base

sundanese-roberta-base

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

w11wo commited on Jul 17, 2021

Commit

f875111

•

1 Parent(s): bad35c5

Create README.md

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+language: su
+tags:
+  - sundanese-roberta-base
+license: mit
+datasets:
+  - mc4
+  - cc100
+  - oscar
+  - wikipedia
+widget:
+  - text: "Budi nuju <mask> di sakola."
+---
+## Sundanese RoBERTa Base
+Sundanese RoBERTa Base is a masked language model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. It was trained on four datasets: [OSCAR](https://hf.co/datasets/oscar)'s `unshuffled_deduplicated_su` subset, the Sundanese [mC4](https://hf.co/datasets/mc4) subset, the Sundanese [CC100](https://hf.co/datasets/cc100) subset, and Sundanese [Wikipedia](https://su.wikipedia.org/).
+10% of the dataset is kept for evaluation purposes. The model was trained from scratch and achieved an evaluation loss of 1.952 and an evaluation accuracy of 63.98%.
+This model was trained using HuggingFace's Flax framework. All necessary scripts used for training could be found in the [Files and versions](https://hf.co/w11wo/sundanese-roberta-base/tree/main) tab, as well as the [Training metrics](https://hf.co/w11wo/sundanese-roberta-base/tensorboard) logged via Tensorboard.
+## Model
+| Model                    | #params | Arch.   | Training/Validation data (text)       |
+| ------------------------ | ------- | ------- | ------------------------------------- |
+| `sundanese-roberta-base` | 124M    | RoBERTa | OSCAR, mC4, CC100, Wikipedia (758 MB) |
+## Evaluation Results
+The model was trained for 50 epochs and the following is the final result once the training ended.
+| train loss | valid loss | valid accuracy | total time |
+| ---------- | ---------- | -------------- | ---------- |
+| 1.965      | 1.952      | 0.6398         | 6:24:51    |
+## How to Use
+### As Masked Language Model
+```python
+from transformers import pipeline
+pretrained_name = "w11wo/sundanese-roberta-base"
+fill_mask = pipeline(
+    "fill-mask",
+    model=pretrained_name,
+    tokenizer=pretrained_name
+)
+fill_mask("Budi nuju <mask> di sakola.")
+```
+### Feature Extraction in PyTorch
+```python
+from transformers import RobertaModel, RobertaTokenizerFast
+pretrained_name = "w11wo/sundanese-roberta-base"
+model = RobertaModel.from_pretrained(pretrained_name)
+tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)
+prompt = "Budi nuju diajar di sakola."
+encoded_input = tokenizer(prompt, return_tensors='pt')
+output = model(**encoded_input)
+```
+## Disclaimer
+Do consider the biases which came from all four datasets that may be carried over into the results of this model.
+## Author
+Sundanese RoBERTa Base was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/).