5CD-AI/visocial-T5-base

Overview

We trimmed vocabulary size to 50,589 and continually pretrained google/mt5-base[1] on a merged 20GB dataset, the training dataset includes:

Crawled data (100M comments and 15M posts on Facebook)
UIT data[2], which is used to pretrain uitnlp/visobert[2]
MC4 ecommerce
10.7M comments on VOZ Forum from tarudesu/VOZ-HSD[7]
3.6M reviews from Amazon[3] translated into Vietnamese from 5CD-AI/Vietnamese-amazon_polarity-gg-translated

Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):

Model	Average MF1	Hate Speech Detection			Toxic Speech Detection			Hate Spans Detection
Model	Average MF1	Acc	WF1	MF1	Acc	WF1	MF1	Acc	WF1	MF1
PhoBERT[4]	69.63	86.75	86.52	64.76	90.78	90.27	71.31	84.65	81.12	72.81
PhoBERT_v2[4]	70.50	87.42	87.33	66.60	90.23	89.78	71.39	84.92	81.51	73.51
viBERT[5]	67.80	86.33	85.79	62.85	88.81	88.17	67.65	84.63	81.28	72.91
ViSoBERT[6]	75.07	88.17	87.86	67.71	90.35	90.16	71.45	90.16	90.07	86.04
ViHateT5[7]	75.56	88.76	89.14	68.67	90.80	91.78	71.63	91.00	90.20	86.37
visocial-T5-base(Ours)	78.01	89.51	89.78	71.19	92.2	93.47	73.81	92.57	92.20	89.04

Visocial-T5-base versus other T5-based models in terms of Vietnamese HSD-related task performance with Macro F1-score:

Model	MF1
Model	Hate Speech Detection	Toxic Speech Detection	Hate Spans Detection
mT5[1]	66.76	69.93	86.60
ViT5[8]	66.95	64.82	86.90
ViHateT5[7]	68.67	71.63	86.37
visocial-T5-base(Ours)	71.90	73.81	89.04

Fine-tune Configuration

We fine-tune 5CD-AI/visocial-T5-base on 3 downstream tasks with transformers library with the following configuration:

seed: 42
training_epochs: 4
train_batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 3e-4
lr_scheduler_type: linear
model_max_length: 256
metric_for_best_model: eval_loss
evaluation_strategy: steps
eval_steps=0.1

References

[1] mT5: A massively multilingual pre-trained text-to-text transformer

[2] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

[3] The Amazon Polarity dataset

[4] PhoBERT: Pre-trained language models for Vietnamese

[5] Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models

[6] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

[7] ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model

[8] ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation