w11wo
/

sundanese-gpt2-base

Text Generation

sundanese-gpt2-base

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

w11wo commited on Jul 17, 2021

Commit

fd2fb55

•

1 Parent(s): a932a01

Create README.md

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+language: su
+tags:
+  - sundanese-gpt2-base
+license: mit
+datasets:
+  - mc4
+  - cc100
+  - oscar
+  - wikipedia
+widget:
+  - text: "Nami abdi Budi, ti Indonésia"
+---
+## Sundanese GPT-2 Base
+Sundanese GPT-2 Base is a causal language model based on the [OpenAI GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. It was trained on four datasets: [OSCAR](https://hf.co/datasets/oscar)'s `unshuffled_deduplicated_su` subset, the Sundanese [mC4](https://hf.co/datasets/mc4) subset, the Sundanese [CC100](https://hf.co/datasets/cc100) subset, and Sundanese [Wikipedia](https://su.wikipedia.org/).
+10% of the dataset is kept for evaluation purposes. The model was trained from scratch and achieved an evaluation loss of 3.61 and an evaluation perplexity of 36.97.
+This model was trained using HuggingFace's Flax framework. All necessary scripts used for training could be found in the [Files and versions](https://hf.co/w11wo/sundanese-gpt2-base/tree/main) tab, as well as the [Training metrics](https://hf.co/w11wo/sundanese-gpt2-base/tensorboard) logged via Tensorboard.
+## Model
+| Model                 | #params | Arch. | Training/Validation data (text)       |
+| --------------------- | ------- | ----- | ------------------------------------- |
+| `sundanese-gpt2-base` | 124M    | GPT-2 | OSCAR, mC4, CC100, Wikipedia (758 MB) |
+## Evaluation Results
+The model was trained for 50 epochs and the following is the final result once the training ended.
+| train loss | valid loss | valid PPL | total time |
+| ---------- | ---------- | --------- | ---------- |
+| 2.436      | 3.61       | 36.97     | 7:1:54     |
+## How to Use
+### As Causal Language Model
+```python
+from transformers import pipeline
+pretrained_name = "w11wo/sundanese-gpt2-base"
+nlp = pipeline(
+    "text-generation",
+    model=pretrained_name,
+    tokenizer=pretrained_name
+)
+nlp("Nami abdi Budi, ti Indonésia")
+```
+### Feature Extraction in PyTorch
+```python
+from transformers import GPT2Model, GPT2TokenizerFast
+pretrained_name = "w11wo/sundanese-gpt2-base"
+model = GPT2Model.from_pretrained(pretrained_name)
+tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_name)
+prompt = "Nami abdi Budi, ti Indonésia"
+encoded_input = tokenizer(prompt, return_tensors='pt')
+output = model(**encoded_input)
+```
+## Disclaimer
+Do consider the biases which came from all four datasets that may be carried over into the results of this model.
+## Author
+Sundanese GPT-2 Base was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/).