updated the readme and tehe model
Browse files- README.md +35 -17
- pytorch_model.bin +1 -1
README.md
CHANGED
@@ -24,23 +24,41 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
24 |
```python
|
25 |
>>> from transformers import pipeline
|
26 |
>>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
|
27 |
-
>>> unmasker("
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
[{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
|
30 |
-
'score': 0.7983310222625732,
|
31 |
-
'token': 1495},
|
32 |
-
{'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
|
33 |
-
'score': 0.090003103017807,
|
34 |
-
'token': 17},
|
35 |
-
{'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
|
36 |
-
'score': 0.025469014421105385,
|
37 |
-
'token': 1600},
|
38 |
-
{'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
|
39 |
-
'score': 0.017966199666261673,
|
40 |
-
'token': 1555},
|
41 |
-
{'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
|
42 |
-
'score': 0.016971781849861145,
|
43 |
-
'token': 1572}]
|
44 |
```
|
45 |
Here is how to use this model to get the features of a given text in PyTorch:
|
46 |
```python
|
@@ -67,7 +85,7 @@ output = model(encoded_input)
|
|
67 |
|
68 |
## Training data
|
69 |
|
70 |
-
This model was
|
71 |
[indonesian newspapers](https://huggingface.co/datasets/id_newspapers_2018).
|
72 |
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
|
73 |
then of the form:
|
|
|
24 |
```python
|
25 |
>>> from transformers import pipeline
|
26 |
>>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
|
27 |
+
>>> unmasker("Ayahku sedang bekerja di sawah untuk [MASK] padi")
|
28 |
+
|
29 |
+
[
|
30 |
+
{
|
31 |
+
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk menanam padi [SEP]",
|
32 |
+
"score": 0.6853187084197998,
|
33 |
+
"token": 12712,
|
34 |
+
"token_str": "menanam"
|
35 |
+
},
|
36 |
+
{
|
37 |
+
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk bertani padi [SEP]",
|
38 |
+
"score": 0.03739545866847038,
|
39 |
+
"token": 15484,
|
40 |
+
"token_str": "bertani"
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk memetik padi [SEP]",
|
44 |
+
"score": 0.02742469497025013,
|
45 |
+
"token": 30338,
|
46 |
+
"token_str": "memetik"
|
47 |
+
},
|
48 |
+
{
|
49 |
+
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk penggilingan padi [SEP]",
|
50 |
+
"score": 0.02214187942445278,
|
51 |
+
"token": 28252,
|
52 |
+
"token_str": "penggilingan"
|
53 |
+
},
|
54 |
+
{
|
55 |
+
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk tanam padi [SEP]",
|
56 |
+
"score": 0.0185895636677742,
|
57 |
+
"token": 11308,
|
58 |
+
"token_str": "tanam"
|
59 |
+
}
|
60 |
+
]
|
61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
```
|
63 |
Here is how to use this model to get the features of a given text in PyTorch:
|
64 |
```python
|
|
|
85 |
|
86 |
## Training data
|
87 |
|
88 |
+
This model was distiled with 522MB of indonesian Wikipedia and 1GB of
|
89 |
[indonesian newspapers](https://huggingface.co/datasets/id_newspapers_2018).
|
90 |
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
|
91 |
then of the form:
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 272513919
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:39b114f8d3260960d4a3a28c2b1ba0543e4ec09a96342d88747f1bed1cd9ab0e
|
3 |
size 272513919
|