diff --git "a/README.md" "b/README.md" --- "a/README.md" +++ "b/README.md" @@ -1,586 +1,22 @@ - - - - - - - - - - - - - - - - - - - - - - - - README.md · Kamel/DarijaBERT at main - - -
- - -
Hugging Face is way more fun with friends and colleagues! 🤗 - Join an organization -
-
- -

-
Kamel Gaanoun's picture - Kamel -
/
-
DarijaBERT -
-
-
-
-

- -
-
- - -
- - - - -
-
-
- - - - -
-
-
-
-
- -
- - - - -
-
DarijaBERT - / - README.md - -
- -
-
-
Kamel's picture - - Update README.md - 2c1f411 - - -
- - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1
---
2
language: ar
3
widget:
4
 - text: " جاب ليا [MASK] ."
5
 - text: "مشيت نجيب[MASK] فالفرماسيان ."
6
---
7
8
9
AIOX Lab and  SI2M Lab INSEA have joined forces to offer researchers, industrialists and the NLP (Natural Language Processing) community the first intelligent Open Source system that understands Moroccan dialectal language "Darija".
10
11
12
**DarijaBERT** is the first BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model was trained on a total of ~3 Million sequences of Darija dialect representing 691MB of text or a total of ~100M tokens.
13
14
The model was trained on a dataset issued from three different sources:
15
*  Stories written in Darija scrapped from a dedicated website
16
*  Youtube comments from 40 different Moroccan channels
17
*  Tweets crawled based on a list of Darija keywords. 
18
19
More details about DarijaBert are available in the dedicated GitHub [repository](https://github.com/AIOXLABS/DBert) 
20
21
**Loading the model**
22
23
The model can be loaded directly using the Huggingface library:
24
25
```python
26
from transformers import AutoTokenizer, AutoModel
27
DarijaBERT_tokenizer = AutoTokenizer.from_pretrained("Kamel/DarijaBERT")
28
DarijaBert_model = AutoModel.from_pretrained("Kamel/DarijaBERT")
29
```
30
 
31
**Acknowledgments**
32
33
We gratefully acknowledge Google’s TensorFlow Research Cloud (TRC) program for providing us with free Cloud TPUs.
34
35
36
-
-
- - - - - - - - - - - +--- +language: ar +widget: + - text: " mchit njib [MASK] ." + - text: " twder lia [MASK]." +--- +AIOX Lab and SI2M Lab INSEA have joined forces to offer researchers, industrialists and the NLP (Natural Language Processing) community the first intelligent Open Source system that understands Moroccan dialectal language "Darija". +**DarijaBERT** is the first BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model is the Arabizi specific version of DarijaBERT and it was trained on a total of ~4.6 Million sequences of Darija dialect written in Latin letters. + +The model was trained on a dataset issued from Youtube comments. + +More details about DarijaBert are available in the dedicated GitHub [repository](https://github.com/AIOXLABS/DBert) +**Loading the model** +The model can be loaded directly using the Huggingface library: +```python +from transformers import AutoTokenizer, AutoModel +DarijaBERT_tokenizer = AutoTokenizer.from_pretrained("Kamel/DarijaBERT-arabizi") +DarijaBert_model = AutoModel.from_pretrained("Kamel/DarijaBERT-arabizi") +``` + +**Acknowledgments** +We gratefully acknowledge Google’s TensorFlow Research Cloud (TRC) program for providing us with free Cloud TPUs.