Dialectical-MSA-detection
Model description
This model was trained on 108,173 manually annotated User-Generated Content (e.g. tweets and online comments) to classify the Arabic language of the text into one of two categories: 'Dialectical', or 'MSA' (i.e. Modern Standard Arabic).
Training data
Dialectical-MSA-detection was trained on the English-speaking subset of the The Arabic online commentary dataset (Zaidan, et al 20211). The AOC dataset was created by crawling the websites of three Arabic newspapers, and extracting online articles and readers' comments.
Training procedure
xlm-roberta-base
was trained using the Hugging Face trainer with the following hyperparameters.
training_args = TrainingArguments(
num_train_epochs=4, # total number of training epochs
learning_rate=2e-5, # learning rate
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=4, # batch size for evaluation
warmup_steps=0, # number of warmup steps for learning rate scheduler
weight_decay=0.02, # strength of weight decay
)
Eval results
The model was evaluated using 10% of the sentences (90-10 train-dev split). Accuracy 0.88 on the dev set.
Limitations and bias
The model was trained on sentences from the online commentary domain. Other forms of UGT such as tweet can be different in the degree of dialectness.
BibTeX entry and citation info
@article{saadany2022semi,
title={A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT},
author={Saadany, Hadeel and Orasan, Constantin and Mohamed, Emad and Tantawy, Ashraf},
journal={arXiv preprint arXiv:2210.11899},
year={2022}
}
- Downloads last month
- 25