wissamantoun
commited on
Commit
•
590a87e
1
Parent(s):
c655777
Update README.md
Browse files
README.md
CHANGED
@@ -46,9 +46,9 @@ All models are available in the `HuggingFace` model page under the [aubmindlab](
|
|
46 |
|
47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
48 |
|
49 |
-
The new vocabulary was
|
50 |
|
51 |
-
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing
|
52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
53 |
|
54 |
## Bigger Dataset and More Compute
|
@@ -86,7 +86,7 @@ It is recommended to apply our preprocessing function before training/testing on
|
|
86 |
```python
|
87 |
from arabert.preprocess import ArabertPreprocessor
|
88 |
|
89 |
-
model_name="bert-
|
90 |
arabert_prep = ArabertPreprocessor(model_name=model_name)
|
91 |
|
92 |
text = "ولن نبالغ إذا قلنا: إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
|
|
|
46 |
|
47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
48 |
|
49 |
+
The new vocabulary was learned using the `BertWordpieceTokenizer` from the `tokenizers` library, and should now support the Fast tokenizer implementation from the `transformers` library.
|
50 |
|
51 |
+
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing function
|
52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
53 |
|
54 |
## Bigger Dataset and More Compute
|
|
|
86 |
```python
|
87 |
from arabert.preprocess import ArabertPreprocessor
|
88 |
|
89 |
+
model_name="aubmindlab/bert-large-arabertv02"
|
90 |
arabert_prep = ArabertPreprocessor(model_name=model_name)
|
91 |
|
92 |
text = "ولن نبالغ إذا قلنا: إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
|