w11wo's picture
Update README.md
167ffe8
|
raw
history blame
4.91 kB
metadata
language: id
tags:
  - indonesian-roberta-base-prdect-id
license: apache-2.0
datasets:
  - prdect-id
widget:
  - text: Wah, kualitas produk ini sangat bagus!

Indonesian RoBERTa Base PRDECT-ID

Indonesian RoBERTa Base PRDECT-ID is a emotion text-classification model based on the RoBERTa model. The model was originally the pre-trained Indonesian RoBERTa Base model, which is then fine-tuned on the PRDECT-ID dataset consisting of Indonesian product reviews (Sutoyo et al., 2022).

This model was trained using HuggingFace's PyTorch framework. All training was done on a NVIDIA T4, provided by Google Colaboratory. Training metrics were logged via Tensorboard.

Model

Model #params Arch. Training/Validation data (text)
indonesian-roberta-base-prdect-id 124M RoBERTa Base PRDECT-ID

Evaluation Results

The model achieves the following results on evaluation:

Dataset Accuracy F1 Precision Recall
PRDECT-ID 0.685185 0.644750 0.646400 0.643710

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss Accuracy F1 Precision Recall
1.0358 1.0 152 0.8293 0.6519 0.5814 0.6399 0.5746
0.7012 2.0 304 0.7444 0.6741 0.6269 0.6360 0.6220
0.5599 3.0 456 0.7635 0.6852 0.6440 0.6433 0.6453
0.4628 4.0 608 0.8031 0.6852 0.6421 0.6471 0.6396
0.4027 5.0 760 0.8133 0.6852 0.6447 0.6464 0.6437

How to Use

As Text Classifier

from transformers import pipeline

pretrained_name = "w11wo/indonesian-roberta-base-prdect-id"

nlp = pipeline(
    "sentiment-analysis",
    model=pretrained_name,
    tokenizer=pretrained_name
)

nlp("Wah, kualitas produk ini sangat bagus!")

Disclaimer

Do consider the biases which come from both the pre-trained RoBERTa model and the PRDECT-ID dataset that may be carried over into the results of this model.

Author

Indonesian RoBERTa Base PRDECT-ID was trained and evaluated by Wilson Wongso. All computation and development are done on Google Colaboratory using their free GPU access.

Framework versions

  • Transformers 4.24.0
  • Pytorch 1.12.1+cu113
  • Datasets 2.7.1
  • Tokenizers 0.13.2

References

@article{SUTOYO2022108554,
    title = {PRDECT-ID: Indonesian product reviews dataset for emotions classification tasks},
    journal = {Data in Brief},
    volume = {44},
    pages = {108554},
    year = {2022},
    issn = {2352-3409},
    doi = {https://doi.org/10.1016/j.dib.2022.108554},
    url = {https://www.sciencedirect.com/science/article/pii/S2352340922007612},
    author = {Rhio Sutoyo and Said Achmad and Andry Chowanda and Esther Widhi Andangsari and Sani M. Isa},
    keywords = {Natural language processing, Text processing, Text mining, Emotions classification, Sentiment analysis},
    abstract = {Recognizing emotions is vital in communication. Emotions convey additional meanings to the communication process. Nowadays, people can communicate their emotions on many platforms; one is the product review. Product reviews in the online platform are an important element that affects customers’ buying decisions. Hence, it is essential to recognize emotions from the product reviews. Emotions recognition from the product reviews can be done automatically using a machine or deep learning algorithm. Dataset can be considered as the fuel to model the recognizer. However, only a limited dataset exists in recognizing emotions from the product reviews, particularly in a local language. This research contributes to the dataset collection of 5400 product reviews in Indonesian. It was carefully curated from various (29) product categories, annotated with five emotions, and verified by an expert in clinical psychology. The dataset supports an innovative process to build automatic emotion classification on product reviews.}
}