shibing624
commited on
Commit
•
344a656
1
Parent(s):
e90c1f8
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,147 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- zh
|
4 |
+
tags:
|
5 |
+
- bert
|
6 |
+
- pytorch
|
7 |
+
- zh
|
8 |
+
- ner
|
9 |
+
license: "apache-2.0"
|
10 |
---
|
11 |
+
|
12 |
+
# BERT for Chinese Named Entity Recognition(bert4ner) Model
|
13 |
+
英文实体识别模型
|
14 |
+
|
15 |
+
`bert4ner-base-chinese` evaluate CoNLL-2003 test data:
|
16 |
+
|
17 |
+
The overall performance of BERT on CoNLL-2003 **test**:
|
18 |
+
|
19 |
+
| | Accuracy | Recall | F1 |
|
20 |
+
| ------------ | ------------------ | ------------------ | ------------------ |
|
21 |
+
| BertSoftmax | 0.8956 | 0.9132 | 0.9043 |
|
22 |
+
|
23 |
+
在CoNLL-2003的测试集上达到接近SOTA水平。
|
24 |
+
|
25 |
+
BertSoftmax的网络结构(原生BERT)。
|
26 |
+
|
27 |
+
本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用:
|
28 |
+
|
29 |
+
#### 英文实体识别:
|
30 |
+
|
31 |
+
```shell
|
32 |
+
>>> from nerpy import NERModel
|
33 |
+
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
|
34 |
+
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
|
35 |
+
entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
|
36 |
+
```
|
37 |
+
|
38 |
+
模型文件组成:
|
39 |
+
```
|
40 |
+
bert4ner-base-uncased
|
41 |
+
├── config.json
|
42 |
+
├── model_args.json
|
43 |
+
├── pytorch_model.bin
|
44 |
+
├── special_tokens_map.json
|
45 |
+
├── tokenizer_config.json
|
46 |
+
└── vocab.txt
|
47 |
+
```
|
48 |
+
|
49 |
+
## Usage (HuggingFace Transformers)
|
50 |
+
Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this:
|
51 |
+
|
52 |
+
First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.
|
53 |
+
|
54 |
+
Install package:
|
55 |
+
```
|
56 |
+
pip install transformers seqeval
|
57 |
+
```
|
58 |
+
|
59 |
+
```python
|
60 |
+
import os
|
61 |
+
import torch
|
62 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
63 |
+
from seqeval.metrics.sequence_labeling import get_entities
|
64 |
+
|
65 |
+
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
|
66 |
+
|
67 |
+
# Load model from HuggingFace Hub
|
68 |
+
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased")
|
69 |
+
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased")
|
70 |
+
label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
|
71 |
+
"E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]
|
72 |
+
|
73 |
+
sentence = "AL-AIN, United Arab Emirates 1996-12-06"
|
74 |
+
|
75 |
+
|
76 |
+
def get_entity(sentence):
|
77 |
+
tokens = tokenizer.tokenize(sentence)
|
78 |
+
inputs = tokenizer.encode(sentence, return_tensors="pt")
|
79 |
+
with torch.no_grad():
|
80 |
+
outputs = model(inputs).logits
|
81 |
+
predictions = torch.argmax(outputs, dim=2)
|
82 |
+
char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
|
83 |
+
print(sentence)
|
84 |
+
print(char_tags)
|
85 |
+
|
86 |
+
pred_labels = [i[1] for i in char_tags]
|
87 |
+
entities = []
|
88 |
+
line_entities = get_entities(pred_labels)
|
89 |
+
for i in line_entities:
|
90 |
+
word = sentence[i[1]: i[2] + 1]
|
91 |
+
entity_type = i[0]
|
92 |
+
entities.append((word, entity_type))
|
93 |
+
|
94 |
+
print("Sentence entity:")
|
95 |
+
print(entities)
|
96 |
+
|
97 |
+
|
98 |
+
get_entity(sentence)
|
99 |
+
```
|
100 |
+
|
101 |
+
|
102 |
+
### 数据集
|
103 |
+
|
104 |
+
#### 实体识别数据集
|
105 |
+
|
106 |
+
|
107 |
+
| 数据集 | 语料 | 下载链接 | 文件大小 |
|
108 |
+
| :------- | :--------- | :---------: | :---------: |
|
109 |
+
| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
|
110 |
+
| **`PEOPLE中文实体识别数据集`** | 人民日报数据集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
|
111 |
+
| **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集(22万字) | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |
|
112 |
+
|
113 |
+
|
114 |
+
### input format
|
115 |
+
|
116 |
+
Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
|
117 |
+
|
118 |
+
```text
|
119 |
+
EU S-ORG
|
120 |
+
rejects O
|
121 |
+
German S-MISC
|
122 |
+
call O
|
123 |
+
to O
|
124 |
+
boycott O
|
125 |
+
British S-MISC
|
126 |
+
lamb O
|
127 |
+
. O
|
128 |
+
|
129 |
+
Peter B-PER
|
130 |
+
Blackburn E-PER
|
131 |
+
```
|
132 |
+
|
133 |
+
|
134 |
+
如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)
|
135 |
+
|
136 |
+
|
137 |
+
## Citation
|
138 |
+
|
139 |
+
```latex
|
140 |
+
@software{nerpy,
|
141 |
+
author = {Xu Ming},
|
142 |
+
title = {nerpy: Named Entity Recognition toolkit},
|
143 |
+
year = {2022},
|
144 |
+
url = {https://github.com/shibing624/nerpy},
|
145 |
+
}
|
146 |
+
```
|
147 |
+
|