shibing624
/

bert4ner-base-uncased

 ---
+language:
+- zh
+tags:
+- bert
+- pytorch
+- zh
+- ner
+license: "apache-2.0"
 ---
+# BERT for Chinese Named Entity Recognition(bert4ner) Model
+英文实体识别模型
+`bert4ner-base-chinese` evaluate CoNLL-2003 test data：
+The overall performance of BERT on CoNLL-2003 **test**:
+|              | Accuracy  | Recall    | F1  |
+| ------------ | ------------------ | ------------------ | ------------------ |
+| BertSoftmax | 0.8956     | 0.9132   | 0.9043     |
+在CoNLL-2003的测试集上达到接近SOTA水平。
+BertSoftmax的网络结构(原生BERT)。
+本项目开源在实体识别项目：[nerpy](https://github.com/shibing624/nerpy)，可支持bert4ner模型，通过如下命令调用：
+#### 英文实体识别：
+```shell
+>>> from nerpy import NERModel
+>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
+>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
+entities:  [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
+```
+模型文件组成：
+```
+bert4ner-base-uncased
+    ├── config.json
+    ├── model_args.json
+    ├── pytorch_model.bin
+    ├── special_tokens_map.json
+    ├── tokenizer_config.json
+    └── vocab.txt
+```
+## Usage (HuggingFace Transformers)
+Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this:
+First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.
+Install package:
+```
+pip install transformers seqeval
+```
+```python
+import os
+import torch
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+from seqeval.metrics.sequence_labeling import get_entities
+os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
+# Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased")
+model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased")
+label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
+                "E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]
+sentence = "AL-AIN, United Arab Emirates 1996-12-06"
+def get_entity(sentence):
+    tokens = tokenizer.tokenize(sentence)
+    inputs = tokenizer.encode(sentence, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model(inputs).logits
+    predictions = torch.argmax(outputs, dim=2)
+    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
+    print(sentence)
+    print(char_tags)
+    pred_labels = [i[1] for i in char_tags]
+    entities = []
+    line_entities = get_entities(pred_labels)
+    for i in line_entities:
+        word = sentence[i[1]: i[2] + 1]
+        entity_type = i[0]
+        entities.append((word, entity_type))
+    print("Sentence entity:")
+    print(entities)
+get_entity(sentence)
+```
+### 数据集
+#### 实体识别数据集
+| 数据集 | 语料 | 下载链接 | 文件大小 |
+| :------- | :--------- | :---------: | :---------: |
+| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
+| **`PEOPLE中文实体识别数据集`** | 人民日报数据集（200万字） | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
+| **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集（22万字） | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |
+### input format
+Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
+```text
+EU	S-ORG
+rejects	O
+German	S-MISC
+call	O
+to	O
+boycott	O
+British	S-MISC
+lamb	O
+.	O
+Peter	B-PER
+Blackburn	E-PER
+```
+如果需要训练bert4ner，请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)
+## Citation
+```latex
+@software{nerpy,
+  author = {Xu Ming},
+  title = {nerpy: Named Entity Recognition toolkit},
+  year = {2022},
+  url = {https://github.com/shibing624/nerpy},
+}
+```