5CD-AI/visocial-T5-base
Overview
We trimmed vocabulary size to 50,589 and continually pretrained google/mt5-base
[1] on a merged 20GB dataset, the training dataset includes:
- Crawled data (100M comments and 15M posts on Facebook)
- UIT data[2], which is used to pretrain
uitnlp/visobert
[2] - MC4 ecommerce
- 10.7M comments on VOZ Forum from
tarudesu/VOZ-HSD
[7] - 3.6M reviews from Amazon[3] translated into Vietnamese from
5CD-AI/Vietnamese-amazon_polarity-gg-translated
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
Model | Average MF1 | Hate Speech Detection | Toxic Speech Detection | Hate Spans Detection | ||||||
Acc | WF1 | MF1 | Acc | WF1 | MF1 | Acc | WF1 | MF1 | ||
PhoBERT[4] | 69.63 | 86.75 | 86.52 | 64.76 | 90.78 | 90.27 | 71.31 | 84.65 | 81.12 | 72.81 |
PhoBERT_v2[4] | 70.50 | 87.42 | 87.33 | 66.60 | 90.23 | 89.78 | 71.39 | 84.92 | 81.51 | 73.51 |
viBERT[5] | 67.80 | 86.33 | 85.79 | 62.85 | 88.81 | 88.17 | 67.65 | 84.63 | 81.28 | 72.91 |
ViSoBERT[6] | 75.07 | 88.17 | 87.86 | 67.71 | 90.35 | 90.16 | 71.45 | 90.16 | 90.07 | 86.04 |
ViHateT5[7] | 75.56 | 88.76 | 89.14 | 68.67 | 90.80 | 91.78 | 71.63 | 91.00 | 90.20 | 86.37 |
visocial-T5-base(Ours) | 78.01 | 89.51 | 89.78 | 71.19 | 92.2 | 93.47 | 73.81 | 92.57 | 92.20 | 89.04 |
Visocial-T5-base versus other T5-based models in terms of Vietnamese HSD-related task performance with Macro F1-score:
Model | MF1 | ||
Hate Speech Detection | Toxic Speech Detection | Hate Spans Detection | |
mT5[1] | 66.76 | 69.93 | 86.60 |
ViT5[8] | 66.95 | 64.82 | 86.90 |
ViHateT5[7] | 68.67 | 71.63 | 86.37 |
visocial-T5-base(Ours) | 71.90 | 73.81 | 89.04 |
Fine-tune Configuration
We fine-tune 5CD-AI/visocial-T5-base
on 3 downstream tasks with transformers
library with the following configuration:
- seed: 42
- training_epochs: 4
- train_batch_size: 4
- gradient_accumulation_steps: 8
- learning_rate: 3e-4
- lr_scheduler_type: linear
- model_max_length: 256
- metric_for_best_model: eval_loss
- evaluation_strategy: steps
- eval_steps=0.1
References
[1] mT5: A massively multilingual pre-trained text-to-text transformer
[2] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing
[3] The Amazon Polarity dataset
[4] PhoBERT: Pre-trained language models for Vietnamese
[5] Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
[6] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing
[8] ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation
- Downloads last month
- 488