update auto tokenizer support
Browse files- README.md +12 -9
- config.json +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +1 -0
README.md
CHANGED
@@ -1,5 +1,8 @@
|
|
1 |
---
|
2 |
language: zh
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
# albert_chinese_large
|
@@ -7,25 +10,25 @@ language: zh
|
|
7 |
This a albert_chinese_large model from [Google's github](https://github.com/google-research/ALBERT)
|
8 |
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
|
9 |
|
10 |
-
##
|
|
|
11 |
|
12 |
-
Since sentencepiece is not used in
|
13 |
you have to call BertTokenizer instead of AlbertTokenizer !!!
|
14 |
we can eval it using an example on MaskedLM
|
15 |
|
16 |
-
由於
|
17 |
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
|
18 |
我們可以跑MaskedLM預測來驗證這個做法是否正確
|
19 |
|
20 |
## Justify (驗證有效性)
|
21 |
-
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
|
22 |
```python
|
23 |
-
from transformers import
|
24 |
import torch
|
25 |
from torch.nn.functional import softmax
|
26 |
|
27 |
pretrained = 'voidful/albert_chinese_large'
|
28 |
-
tokenizer =
|
29 |
model = AlbertForMaskedLM.from_pretrained(pretrained)
|
30 |
|
31 |
inputtext = "今天[MASK]情很好"
|
@@ -33,11 +36,11 @@ inputtext = "今天[MASK]情很好"
|
|
33 |
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
|
34 |
|
35 |
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
36 |
-
outputs = model(input_ids,
|
37 |
loss, prediction_scores = outputs[:2]
|
38 |
-
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
|
39 |
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
|
40 |
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
|
41 |
-
print(predicted_token,logit_prob[predicted_index])
|
42 |
```
|
43 |
Result: `心 0.9422469735145569`
|
|
|
1 |
---
|
2 |
language: zh
|
3 |
+
pipeline_tag: fill-mask
|
4 |
+
widget:
|
5 |
+
- text: "今天[MASK]情很好"
|
6 |
---
|
7 |
|
8 |
# albert_chinese_large
|
|
|
10 |
This a albert_chinese_large model from [Google's github](https://github.com/google-research/ALBERT)
|
11 |
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
|
12 |
|
13 |
+
## Notice
|
14 |
+
*Support AutoTokenizer*
|
15 |
|
16 |
+
Since sentencepiece is not used in albert_chinese_base model
|
17 |
you have to call BertTokenizer instead of AlbertTokenizer !!!
|
18 |
we can eval it using an example on MaskedLM
|
19 |
|
20 |
+
由於 albert_chinese_base 模型沒有用 sentencepiece
|
21 |
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
|
22 |
我們可以跑MaskedLM預測來驗證這個做法是否正確
|
23 |
|
24 |
## Justify (驗證有效性)
|
|
|
25 |
```python
|
26 |
+
from transformers import AutoTokenizer, AlbertForMaskedLM
|
27 |
import torch
|
28 |
from torch.nn.functional import softmax
|
29 |
|
30 |
pretrained = 'voidful/albert_chinese_large'
|
31 |
+
tokenizer = AutoTokenizer.from_pretrained(pretrained)
|
32 |
model = AlbertForMaskedLM.from_pretrained(pretrained)
|
33 |
|
34 |
inputtext = "今天[MASK]情很好"
|
|
|
36 |
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
|
37 |
|
38 |
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
39 |
+
outputs = model(input_ids, labels=input_ids)
|
40 |
loss, prediction_scores = outputs[:2]
|
41 |
+
logit_prob = softmax(prediction_scores[0, maskpos],dim=-1).data.tolist()
|
42 |
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
|
43 |
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
|
44 |
+
print(predicted_token, logit_prob[predicted_index])
|
45 |
```
|
46 |
Result: `心 0.9422469735145569`
|
config.json
CHANGED
@@ -1,4 +1,7 @@
|
|
1 |
{
|
|
|
|
|
|
|
2 |
"attention_probs_dropout_prob": 0,
|
3 |
"bos_token_id": 2,
|
4 |
"classifier_dropout_prob": 0.1,
|
|
|
1 |
{
|
2 |
+
"architectures": [
|
3 |
+
"AlbertForMaskedLM"
|
4 |
+
],
|
5 |
"attention_probs_dropout_prob": 0,
|
6 |
"bos_token_id": 2,
|
7 |
"classifier_dropout_prob": 0.1,
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "voidful/albert_chinese_large", "tokenizer_class": "BertTokenizer"}
|