File size: 2,798 Bytes
e14a9e0 1f87701 330dea7 1f87701 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- Dialectal Arabic
- Arabic
- sequence labeling
- Named entity recognition
- Part-of-speech tagging
- Zero-shot transfer learning
- bert
license: "mit"
---
# Dialectal Arabic XLM-R Base
This is a repo of the language model used for "AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling". The state-of-the-art method for sequence labeling on multi-dialect Arabic.
### About the Dialectal-Arabic-XLM-R-Base model
This model is an trained as a further pre-trained of XLM-RoBERTa base using the Masked-language modeling on a dialectal Arabic corpus.
### About the Dialectal-Arabic-XLM-R-Base model training corpora
We have built a 5 million Tweets corpus from Twitter. The crawled tweets cover the dialects of the four Arabic world regions (EGY, GLF, LEV, and MAG regions), as well as MSA. The collected corpus consists of one million (1M) tweets per Arabic variant. We did not perform any text pre-processing on the tweets, except by removing tweets that have a small length (tweets containing less than four words).
### Usage
The model weights can be loaded using `transformers` library by HuggingFace.
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("3ebdola/Dialectal-Arabic-XLM-R-Base")
model = AutoModel.from_pretrained("3ebdola/Dialectal-Arabic-XLM-R-Base")
text = "هذا مثال لنص باللغة العربية, يمكنك استعمال اللهجات العربية أيضا"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
### Citation
```
@article{ELMEKKI2022102964,
title = {AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling},
journal = {Information Processing & Management},
volume = {59},
number = {4},
pages = {102964},
year = {2022},
issn = {0306-4573},
doi = {https://doi.org/10.1016/j.ipm.2022.102964},
url = {https://www.sciencedirect.com/science/article/pii/S0306457322000814},
author = {Abdellah {El Mekki} and Abdelkader {El Mahdaouy} and Ismail Berrada and Ahmed Khoumsi},
keywords = {Dialectal Arabic, Arabic natural language processing, Domain adaptation, Multi-dialectal sequence labeling, Named entity recognition, Part-of-speech tagging, Zero-shot transfer learning}
}
```
|