File size: 4,188 Bytes
ea0bf6d
344a656
3cfe78a
344a656
 
 
3cfe78a
344a656
 
ea0bf6d
344a656
3cfe78a
344a656
 
3cfe78a
344a656
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90c27a7
 
344a656
90c27a7
344a656
 
 
 
 
 
 
 
 
 
90c27a7
344a656
90c27a7
344a656
90c27a7
344a656
 
 
90c27a7
344a656
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
language: 
- en
tags:
- bert
- pytorch
- en
- ner
license: "apache-2.0"
---

# BERT for English Named Entity Recognition(bert4ner) Model
英文实体识别模型

`bert4ner-base-uncased` evaluate CoNLL-2003 test data:

The overall performance of BERT on CoNLL-2003 **test**:

|              | Accuracy  | Recall    | F1  |
| ------------ | ------------------ | ------------------ | ------------------ |
| BertSoftmax | 0.8956     | 0.9132   | 0.9043     |

在CoNLL-2003的测试集上达到接近SOTA水平。

BertSoftmax的网络结构(原生BERT)。

本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用:

#### 英文实体识别:

```shell
>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
entities:  [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
```

模型文件组成:
```
bert4ner-base-uncased
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt
```

## Usage (HuggingFace Transformers)
Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this: 

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

Install package:
```
pip install transformers seqeval
```

```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("../bert4ner-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("../bert4ner-base-uncased")
label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
              "E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]

sentence = "AL-AIN, United Arab Emirates 1996-12-06"


def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])]
    print(sentence)
    print(word_tags)

    pred_labels = [i[1] for i in word_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = tokens[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)
```


### 数据集

#### 实体识别数据集


| 数据集 | 语料 | 下载链接 | 文件大小 |
| :------- | :--------- | :---------: | :---------: |
| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
| **`PEOPLE中文实体识别数据集`** | 人民日报数据集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
| **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集(22万字) | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |


### input format

Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

```text
EU	S-ORG
rejects	O
German	S-MISC
call	O
to	O
boycott	O
British	S-MISC
lamb	O
.	O

Peter	B-PER
Blackburn	E-PER
```


如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)


## Citation

```latex
@software{nerpy,
  author = {Xu Ming},
  title = {nerpy: Named Entity Recognition toolkit},
  year = {2022},
  url = {https://github.com/shibing624/nerpy},
}
```