File size: 1,908 Bytes
e71dd13
a91993f
 
 
e71dd13
 
ad385e8
e71dd13
 
 
2c1f411
e71dd13
 
331e57f
55fc5d7
 
 
 
 
 
331e57f
55fc5d7
 
 
 
 
 
 
f88ae61
 
55fc5d7
9fa9761
 
 
 
 
 
 
 
 
 
 
 
 
55fc5d7
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language:
- ar
- ary
widget:
 - text: " جاب ليا [MASK] ."
 - text: "مشيت نجيب[MASK] فالفرماسيان ."
---


AIOX Lab and  SI2M Lab INSEA have joined forces to offer researchers, industrialists and the NLP (Natural Language Processing) community the first intelligent Open Source system that understands Moroccan dialectal language "Darija".


**DarijaBERT** is the first BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model was trained on a total of ~3 Million sequences of Darija dialect representing 691MB of text or a total of ~100M tokens.

The model was trained on a dataset issued from three different sources:
*  Stories written in Darija scrapped from a dedicated website
*  Youtube comments from 40 different Moroccan channels
*  Tweets crawled based on a list of Darija keywords. 

More details about DarijaBert are available in the dedicated GitHub [repository](https://github.com/AIOXLABS/DBert) 

**Loading the model**

The model can be loaded directly using the Huggingface library:

```python
from transformers import AutoTokenizer, AutoModel
DarijaBERT_tokenizer = AutoTokenizer.from_pretrained("SI2M-Lab/DarijaBERT")
DarijaBert_model = AutoModel.from_pretrained("SI2M-Lab/DarijaBERT")
```

**Citation**

If you use our models  for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):
```
@article{gaanoun2023darijabert,
  title={Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect},
  author={Gaanoun, Kamel and Naira, Abdou Mohamed and Allak, Anass and Benelallam, Imade},
  year={2023}
}

```

 
**Acknowledgments**

We gratefully acknowledge Google’s TensorFlow Research Cloud (TRC) program for providing us with free Cloud TPUs.