File size: 4,319 Bytes
ea0bf6d 853c7fd 3cfe78a 344a656 3cfe78a 344a656 853c7fd ea0bf6d 344a656 3cfe78a 344a656 3cfe78a 344a656 a0011f0 344a656 90c27a7 344a656 90c27a7 344a656 90c27a7 344a656 90c27a7 344a656 90c27a7 344a656 853c7fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
language:
- en
tags:
- bert
- pytorch
- en
- ner
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
widget:
- text: AL-AIN, United Arab Emirates 1996-12-06
---
# BERT for English Named Entity Recognition(bert4ner) Model
英文实体识别模型
`bert4ner-base-uncased` evaluate CoNLL-2003 test data:
The overall performance of BERT on CoNLL-2003 **test**:
| | Accuracy | Recall | F1 |
| ------------ | ------------------ | ------------------ | ------------------ |
| BertSoftmax | 0.8956 | 0.9132 | 0.9043 |
在CoNLL-2003的测试集上达到接近SOTA水平。
BertSoftmax的网络结构(原生BERT)。
本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用:
#### 英文实体识别:
```shell
>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
```
模型文件组成:
```
bert4ner-base-uncased
├── config.json
├── model_args.json
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.txt
```
## Usage (HuggingFace Transformers)
Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this:
First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.
Install package:
```
pip install transformers seqeval
```
```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased")
label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
"E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]
sentence = "AL-AIN, United Arab Emirates 1996-12-06"
def get_entity(sentence):
tokens = tokenizer.tokenize(sentence)
inputs = tokenizer.encode(sentence, return_tensors="pt")
with torch.no_grad():
outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)
word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])]
print(sentence)
print(word_tags)
pred_labels = [i[1] for i in word_tags]
entities = []
line_entities = get_entities(pred_labels)
for i in line_entities:
word = tokens[i[1]: i[2] + 1]
entity_type = i[0]
entities.append((word, entity_type))
print("Sentence entity:")
print(entities)
get_entity(sentence)
```
### 数据集
#### 实体识别数据集
| 数据集 | 语料 | 下载链接 | 文件大小 |
| :------- | :--------- | :---------: | :---------: |
| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
| **`PEOPLE中文实体识别数据集`** | 人民日报数据集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
| **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集(22万字) | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |
### input format
Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
```text
EU S-ORG
rejects O
German S-MISC
call O
to O
boycott O
British S-MISC
lamb O
. O
Peter B-PER
Blackburn E-PER
```
如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)
## Citation
```latex
@software{nerpy,
author = {Xu Ming},
title = {nerpy: Named Entity Recognition toolkit},
year = {2022},
url = {https://github.com/shibing624/nerpy},
}
``` |