w11wo
/

sundanese-roberta-base

sundanese-roberta-base

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

sundanese-roberta-base / README.md

w11wo's picture

Create README.md

f875111 over 3 years ago

|

2.58 kB

	---
	language: su
	tags:
	- sundanese-roberta-base
	license: mit
	datasets:
	- mc4
	- cc100
	- oscar
	- wikipedia
	widget:
	- text: "Budi nuju <mask> di sakola."
	---

	## Sundanese RoBERTa Base

	Sundanese RoBERTa Base is a masked language model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. It was trained on four datasets: [OSCAR](https://hf.co/datasets/oscar)'s `unshuffled_deduplicated_su` subset, the Sundanese [mC4](https://hf.co/datasets/mc4) subset, the Sundanese [CC100](https://hf.co/datasets/cc100) subset, and Sundanese [Wikipedia](https://su.wikipedia.org/).

	10% of the dataset is kept for evaluation purposes. The model was trained from scratch and achieved an evaluation loss of 1.952 and an evaluation accuracy of 63.98%.

	This model was trained using HuggingFace's Flax framework. All necessary scripts used for training could be found in the [Files and versions](https://hf.co/w11wo/sundanese-roberta-base/tree/main) tab, as well as the [Training metrics](https://hf.co/w11wo/sundanese-roberta-base/tensorboard) logged via Tensorboard.

	## Model

	\| Model \| #params \| Arch. \| Training/Validation data (text) \|
	\| ------------------------ \| ------- \| ------- \| ------------------------------------- \|
	\| `sundanese-roberta-base` \| 124M \| RoBERTa \| OSCAR, mC4, CC100, Wikipedia (758 MB) \|

	## Evaluation Results

	The model was trained for 50 epochs and the following is the final result once the training ended.

	\| train loss \| valid loss \| valid accuracy \| total time \|
	\| ---------- \| ---------- \| -------------- \| ---------- \|
	\| 1.965 \| 1.952 \| 0.6398 \| 6:24:51 \|

	## How to Use

	### As Masked Language Model

	```python
	from transformers import pipeline

	pretrained_name = "w11wo/sundanese-roberta-base"

	fill_mask = pipeline(
	"fill-mask",
	model=pretrained_name,
	tokenizer=pretrained_name
	)

	fill_mask("Budi nuju <mask> di sakola.")
	```

	### Feature Extraction in PyTorch

	```python
	from transformers import RobertaModel, RobertaTokenizerFast

	pretrained_name = "w11wo/sundanese-roberta-base"
	model = RobertaModel.from_pretrained(pretrained_name)
	tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)

	prompt = "Budi nuju diajar di sakola."
	encoded_input = tokenizer(prompt, return_tensors='pt')
	output = model(**encoded_input)
	```

	## Disclaimer

	Do consider the biases which came from all four datasets that may be carried over into the results of this model.

	## Author

	Sundanese RoBERTa Base was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/).