ArabicT5-17GB-large / README.md
sultan's picture
Update README.md
5dafa0d
|
raw
history blame
5.38 kB

ArabicT5: Efficient Adaptation of T5 on Arabic Language

Model Description

This model adapt T5 on Arabic Language by pre-training T5 on ArabicWikipedia, Marefa, Hindawi Books and collection of Arabic News. Total Corpora size is 17GB. We restrict our corpora to News and Encyclopedias to enhance the performance of the model on informative tasks such as Factoid Question Answering and Generative task that uses classic Arabic ( الفصحى ). This model uses an efficient implementation of T5 which reduces the fine-tuning and memory used Link .

Pre-training Settings and Results on TyDi QA Development Dataset ( Model in this card is highlighted in bold )

Model Hidden Layer Atten. head Atten. Layers Vocab Hardware Training Steps Batch Train x Batch Factor Corpora
AraT5-Base 768 12 12 110K TPUv3-8 1M 128 1.0x 248GB 29B tokens (MSA + Tweets)
AraT5-Base-MSA 768 12 12 110K TPUv3-8 1M 128 1.0x 70GB (MSA)
AraT5-Base-Tweets 768 12 12 110K TPUv3-8 1M 128 1.0x 178GB (Tweets)
mT5-Base 768 12 12 250K TPUv3-32 1M 1024 8.0x 6.3T tokens (mC4)
ArabicT5-Base 512 8 20 32K TPUv3-32 256K 256 0.5x 17GB (MSA)
ArabicT5-Large 768 12 16 32K TPUv3-128 500K 512 2.0x 17GB (MSA)
ArabicT5-xLarge 768 12 36 32K TPUv3-128 500K 512 2.0x 17GB (MSA)

Results on TyDi QA, HARD, Sentiment Analysis, Sarcasm Detection ( Best Score is highlighted in bold )

Model
TyDi QA (Dev)
HARD (Hotel Review)
ArSarcasm-v2 (Sentiment Analysis)
ArSarcasm-v2 (Sarcasm Detection)
AraT5-Base
70.36/84.21
96.49
69.7/72.63
60.44
AraT5-Base-MSA
70.90/84.00
96.52
70.03/72.73
60.69
AraT5-Base-Tweets
65.14/79.00
96.26
70.67/73.52
61.11
mT5-Base
72.20/84.13
96.24
67.33/68.78
52.18
ArabicT5-Base
70.79/84.76
96.36
68.93/71.20
58.93
ArabicT5-Large
73.29/86.08
96.40
70.4/73.01
59.79
ArabicT5-xLarge
75.46/87.12
96.50
72.23/75.17
61.66

Evaluation Metrics : TyDi QA (EM/F1), HARD (Accuracy), Sentiment Analysis (Accuracy / F1-PN positive-negative), Sarcasm Detection (F1-sarcastic)

Paper

Generative Approach for Gender-Rewriting Task with ArabicT5

FineTuning our ArabicT5 model on generative and abstractive tasks with FLAX

Open In Colab

GitHub Page

https://github.com/salrowili/ArabicT5

Acknowledgment

We would like to acknowledge the support we have from The TPU Research Cloud (TRC) team to grant us access to TPUv3 units.

Citation

@inproceedings{alrowili-shanker-2022-generative,
    title = "Generative Approach for Gender-Rewriting Task with {A}rabic{T}5",
    author = "Alrowili, Sultan  and
      Shanker, Vijay",
    booktitle = "Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.wanlp-1.55",
    pages = "491--495",
    abstract = "Addressing the correct gender in generative tasks (e.g., Machine Translation) has been an overlooked issue in the Arabic NLP. However, the recent introduction of the Arabic Parallel Gender Corpus (APGC) dataset has established new baselines for the Arabic Gender Rewriting task. To address the Gender Rewriting task, we first pre-train our new Seq2Seq ArabicT5 model on a 17GB of Arabic Corpora. Then, we continue pre-training our ArabicT5 model on the APGC dataset using a newly proposed method. Our evaluation shows that our ArabicT5 model, when trained on the APGC dataset, achieved competitive results against existing state-of-the-art methods. In addition, our ArabicT5 model shows better results on the APGC dataset compared to other Arabic and multilingual T5 models.",
}