File size: 2,577 Bytes
f875111
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
language: su
tags:
  - sundanese-roberta-base
license: mit
datasets:
  - mc4
  - cc100
  - oscar
  - wikipedia
widget:
  - text: "Budi nuju <mask> di sakola."
---

## Sundanese RoBERTa Base

Sundanese RoBERTa Base is a masked language model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. It was trained on four datasets: [OSCAR](https://hf.co/datasets/oscar)'s `unshuffled_deduplicated_su` subset, the Sundanese [mC4](https://hf.co/datasets/mc4) subset, the Sundanese [CC100](https://hf.co/datasets/cc100) subset, and Sundanese [Wikipedia](https://su.wikipedia.org/).

10% of the dataset is kept for evaluation purposes. The model was trained from scratch and achieved an evaluation loss of 1.952 and an evaluation accuracy of 63.98%.

This model was trained using HuggingFace's Flax framework. All necessary scripts used for training could be found in the [Files and versions](https://hf.co/w11wo/sundanese-roberta-base/tree/main) tab, as well as the [Training metrics](https://hf.co/w11wo/sundanese-roberta-base/tensorboard) logged via Tensorboard.

## Model

| Model                    | #params | Arch.   | Training/Validation data (text)       |
| ------------------------ | ------- | ------- | ------------------------------------- |
| `sundanese-roberta-base` | 124M    | RoBERTa | OSCAR, mC4, CC100, Wikipedia (758 MB) |

## Evaluation Results

The model was trained for 50 epochs and the following is the final result once the training ended.

| train loss | valid loss | valid accuracy | total time |
| ---------- | ---------- | -------------- | ---------- |
| 1.965      | 1.952      | 0.6398         | 6:24:51    |

## How to Use

### As Masked Language Model

```python
from transformers import pipeline

pretrained_name = "w11wo/sundanese-roberta-base"

fill_mask = pipeline(
    "fill-mask",
    model=pretrained_name,
    tokenizer=pretrained_name
)

fill_mask("Budi nuju <mask> di sakola.")
```

### Feature Extraction in PyTorch

```python
from transformers import RobertaModel, RobertaTokenizerFast

pretrained_name = "w11wo/sundanese-roberta-base"
model = RobertaModel.from_pretrained(pretrained_name)
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)

prompt = "Budi nuju diajar di sakola."
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)
```

## Disclaimer

Do consider the biases which came from all four datasets that may be carried over into the results of this model.

## Author

Sundanese RoBERTa Base was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/).