MikkoLipsanen
commited on
Commit
•
6174c03
1
Parent(s):
1a852b6
Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,68 @@ metrics:
|
|
7 |
- accuracy
|
8 |
library_name: transformers
|
9 |
pipeline_tag: token-classification
|
10 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
- accuracy
|
8 |
library_name: transformers
|
9 |
pipeline_tag: token-classification
|
10 |
+
---
|
11 |
+
|
12 |
+
## Finnish named entity recognition
|
13 |
+
|
14 |
+
The model performs named entity recognition from text input in Finnish.
|
15 |
+
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
|
16 |
+
using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
17 |
+
as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland.
|
18 |
+
Since the latter dataset contains also sensitive data, it has not been made publicly available.
|
19 |
+
|
20 |
+
|
21 |
+
## Intended uses & limitations
|
22 |
+
|
23 |
+
The model has been trained to recognize the following named entities from a text in Finnish:
|
24 |
+
|
25 |
+
- PERSON (person names)
|
26 |
+
- ORG (organizations)
|
27 |
+
- LOC (locations)
|
28 |
+
- GPE (geopolitical locations)
|
29 |
+
- PRODUCT (products)
|
30 |
+
- EVENT (events)
|
31 |
+
- DATE (dates)
|
32 |
+
- JON (Finnish journal numbers (diaarinumero))
|
33 |
+
- FIBC (Finnish business identity codes (y-tunnus))
|
34 |
+
- NORP (nationality, religious and political groups)
|
35 |
+
|
36 |
+
Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that
|
37 |
+
recognition accuracy for these entities also tends to be lower.
|
38 |
+
|
39 |
+
The training data is relatively recent, so that the model might face difficulties when the input
|
40 |
+
contains for example old names or writing styles.
|
41 |
+
|
42 |
+
## How to use
|
43 |
+
|
44 |
+
The easiest way to use the model is by utilizing the Transformers pipeline for token classification:
|
45 |
+
|
46 |
+
```python
|
47 |
+
from transformers import pipeline
|
48 |
+
|
49 |
+
model_checkpoint = "Kansallisarkisto/finbert-ner"
|
50 |
+
token_classifier = pipeline(
|
51 |
+
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
|
52 |
+
)
|
53 |
+
token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.")
|
54 |
+
```
|
55 |
+
|
56 |
+
## Training data
|
57 |
+
|
58 |
+
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
59 |
+
dataset were filtered out from the dataset used for training the model. In addition to this dataset, OCR'd and annotated content of
|
60 |
+
digitized documents from Finnish public administration was also used for model training. The number of entities belonging to the different
|
61 |
+
entity classes contained in training, validation and test datasets are listed below:
|
62 |
+
|
63 |
+
Number of entity types in the data
|
64 |
+
Dataset|O|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
|
65 |
+
-|-|-|-|-|-|-|-|-|-|-|-
|
66 |
+
Train|0|0|0|0|0|0|0|0|0|0|0
|
67 |
+
Val|0|0|0|0|0|0|0|0|0|0|0
|
68 |
+
Test|0|0|0|0|0|0|0|0|0|0|0
|
69 |
+
|
70 |
+
## Training procedure
|
71 |
+
|
72 |
+
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
|
73 |
+
|
74 |
+
The training code with instructions is available [here](https://github.com/DALAI-hanke/BERT_NER).
|