|
--- |
|
language: |
|
- en |
|
tags: |
|
- bert |
|
- pytorch |
|
- en |
|
- ner |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: token-classification |
|
widget: |
|
- text: AL-AIN, United Arab Emirates 1996-12-06 |
|
--- |
|
|
|
# BERT for English Named Entity Recognition(bert4ner) Model |
|
英文实体识别模型 |
|
|
|
`bert4ner-base-uncased` evaluate CoNLL-2003 test data: |
|
|
|
The overall performance of BERT on CoNLL-2003 **test**: |
|
|
|
| | Accuracy | Recall | F1 | |
|
| ------------ | ------------------ | ------------------ | ------------------ | |
|
| BertSoftmax | 0.8956 | 0.9132 | 0.9043 | |
|
|
|
在CoNLL-2003的测试集上达到接近SOTA水平。 |
|
|
|
BertSoftmax的网络结构(原生BERT)。 |
|
|
|
本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用: |
|
|
|
#### 英文实体识别: |
|
|
|
```shell |
|
>>> from nerpy import NERModel |
|
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased") |
|
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True) |
|
entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')] |
|
``` |
|
|
|
模型文件组成: |
|
``` |
|
bert4ner-base-uncased |
|
├── config.json |
|
├── model_args.json |
|
├── pytorch_model.bin |
|
├── special_tokens_map.json |
|
├── tokenizer_config.json |
|
└── vocab.txt |
|
``` |
|
|
|
## Usage (HuggingFace Transformers) |
|
Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this: |
|
|
|
First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words. |
|
|
|
Install package: |
|
``` |
|
pip install transformers seqeval |
|
``` |
|
|
|
```python |
|
import os |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from seqeval.metrics.sequence_labeling import get_entities |
|
|
|
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased") |
|
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased") |
|
label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC", |
|
"E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"] |
|
|
|
sentence = "AL-AIN, United Arab Emirates 1996-12-06" |
|
|
|
|
|
def get_entity(sentence): |
|
tokens = tokenizer.tokenize(sentence) |
|
inputs = tokenizer.encode(sentence, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model(inputs).logits |
|
predictions = torch.argmax(outputs, dim=2) |
|
word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])] |
|
print(sentence) |
|
print(word_tags) |
|
|
|
pred_labels = [i[1] for i in word_tags] |
|
entities = [] |
|
line_entities = get_entities(pred_labels) |
|
for i in line_entities: |
|
word = tokens[i[1]: i[2] + 1] |
|
entity_type = i[0] |
|
entities.append((word, entity_type)) |
|
|
|
print("Sentence entity:") |
|
print(entities) |
|
|
|
|
|
get_entity(sentence) |
|
``` |
|
|
|
|
|
### 数据集 |
|
|
|
#### 实体识别数据集 |
|
|
|
|
|
| 数据集 | 语料 | 下载链接 | 文件大小 | |
|
| :------- | :--------- | :---------: | :---------: | |
|
| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB | |
|
| **`PEOPLE中文实体识别数据集`** | 人民日报数据集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB | |
|
| **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集(22万字) | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB | |
|
|
|
|
|
### input format |
|
|
|
Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line. |
|
|
|
```text |
|
EU S-ORG |
|
rejects O |
|
German S-MISC |
|
call O |
|
to O |
|
boycott O |
|
British S-MISC |
|
lamb O |
|
. O |
|
|
|
Peter B-PER |
|
Blackburn E-PER |
|
``` |
|
|
|
|
|
如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples) |
|
|
|
|
|
## Citation |
|
|
|
```latex |
|
@software{nerpy, |
|
author = {Xu Ming}, |
|
title = {nerpy: Named Entity Recognition toolkit}, |
|
year = {2022}, |
|
url = {https://github.com/shibing624/nerpy}, |
|
} |
|
``` |