shibing624's picture
Update README.md
a0011f0
|
raw
history blame
4.2 kB
---
language:
- en
tags:
- bert
- pytorch
- en
- ner
license: "apache-2.0"
---
# BERT for English Named Entity Recognition(bert4ner) Model
英文实体识别模型
`bert4ner-base-uncased` evaluate CoNLL-2003 test data:
The overall performance of BERT on CoNLL-2003 **test**:
| | Accuracy | Recall | F1 |
| ------------ | ------------------ | ------------------ | ------------------ |
| BertSoftmax | 0.8956 | 0.9132 | 0.9043 |
在CoNLL-2003的测试集上达到接近SOTA水平。
BertSoftmax的网络结构(原生BERT)。
本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用:
#### 英文实体识别:
```shell
>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
```
模型文件组成:
```
bert4ner-base-uncased
├── config.json
├── model_args.json
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.txt
```
## Usage (HuggingFace Transformers)
Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this:
First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.
Install package:
```
pip install transformers seqeval
```
```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased")
label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
"E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]
sentence = "AL-AIN, United Arab Emirates 1996-12-06"
def get_entity(sentence):
tokens = tokenizer.tokenize(sentence)
inputs = tokenizer.encode(sentence, return_tensors="pt")
with torch.no_grad():
outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)
word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])]
print(sentence)
print(word_tags)
pred_labels = [i[1] for i in word_tags]
entities = []
line_entities = get_entities(pred_labels)
for i in line_entities:
word = tokens[i[1]: i[2] + 1]
entity_type = i[0]
entities.append((word, entity_type))
print("Sentence entity:")
print(entities)
get_entity(sentence)
```
### 数据集
#### 实体识别数据集
| 数据集 | 语料 | 下载链接 | 文件大小 |
| :------- | :--------- | :---------: | :---------: |
| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
| **`PEOPLE中文实体识别数据集`** | 人民日报数据集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
| **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集(22万字) | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |
### input format
Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
```text
EU S-ORG
rejects O
German S-MISC
call O
to O
boycott O
British S-MISC
lamb O
. O
Peter B-PER
Blackburn E-PER
```
如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)
## Citation
```latex
@software{nerpy,
author = {Xu Ming},
title = {nerpy: Named Entity Recognition toolkit},
year = {2022},
url = {https://github.com/shibing624/nerpy},
}
```