File size: 6,537 Bytes
27af1af 1a852b6 0162f83 1a852b6 4f3f3ae 6174c03 9bf3620 6174c03 a3c4f82 0a5858c 6174c03 9ed3d45 6174c03 454b5e8 6174c03 a3c4f82 6174c03 a3c4f82 6174c03 30b9b83 6174c03 a3c4f82 2ca67a2 a3c4f82 2ca67a2 9bf3620 a3c4f82 373cfeb 6174c03 5afd4b8 25dc135 2ca67a2 6174c03 25dc135 9ef1847 25dc135 9ef1847 25dc135 0a5858c 373cfeb c152461 8b7f8eb 9a51ae0 0162f83 9ef1847 0162f83 9a51ae0 8b7f8eb 9828cf6 4f3f3ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: mit
language:
- fi
metrics:
- f1
- precision
- recall
library_name: transformers
pipeline_tag: token-classification
base_model:
- TurkuNLP/bert-base-finnish-cased-v1
---
## Finnish named entity recognition
The model performs named entity recognition from text input in Finnish.
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
using 10 named entity categories. Training data contains for instance the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one),
the Finnish part of the [NewsEye dataset](https://zenodo.org/record/4573313)
as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
Since the latter dataset contains also sensitive data, it has not been made publicly available.
An example of how the model can be used for named entity recognition is provided in this [Colab notebook](https://colab.research.google.com/drive/1-koUCwz4aU_UvyZxSW-Awwf5RQftFk5m).
Motivations behind model development and the data selection and annotation processes have been described in more detail in the article [Making sense of bureaucratic documents – Named entity
recognition for state authority archives](https://library.imaging.org/archiving/articles/21/1/2).
## Intended uses & limitations
The model has been trained to recognize the following named entities from a text in Finnish:
- PERSON (person names)
- ORG (organizations)
- LOC (locations)
- GPE (geopolitical locations)
- PRODUCT (products)
- EVENT (events)
- DATE (dates)
- JON (Finnish journal numbers (diaarinumero))
- FIBC (Finnish business identity codes (y-tunnus))
- NORP (nationality, religious and political groups)
Some entities, like EVENT and LOC, are less common in the training data than the others, which means that
recognition accuracy for these entities also tends to be lower.
Most of the training data is relatively recent, so that the model might face difficulties when the input
contains for example old names or writing styles.
## How to use
The easiest way to use the model is by utilizing the Transformers pipeline for token classification:
```python
from transformers import pipeline
model_checkpoint = "Kansallisarkisto/finbert-ner"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
predictions = token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.")
print(predictions)
```
## Training data
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
were added during the annotation process. The different data sources used in model training, validation and testing are listed below:
Dataset|Period covered by the texts|Text type|Percentage of the total data
-|-|-|-
[Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts|23%
[NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd digitized newspaper articles|3%
Diverse document data from Finnish public administration|1970s - 2000s|OCR'd digitized documents|69%
Finnish senate documents|1916|Part manually transcribed, part HTR'd digitized documents|3%
Finnish books from [Project Gutenberg](https://www.gutenberg.org)|Early 20th century|OCR'd texts|1%
Theses from Finnish polytechnic universities |2000s|OCR'd texts|1%
The number of entities belonging to the different
entity classes contained in training, validation and test datasets are listed below:
### Number of entity types in the data
Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
-|-|-|-|-|-|-|-|-|-|-
Train|20211|45722|1321|19387|9571|1616|23642|2460|2384|2529
Val|2525|5517|130|2512|1217|240|3047|306|247|283
Test|2414|5577|179|2445|1097|183|2838|272|374|356
## Training procedure
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
- learning rate: 2e-05
- train batch size: 24
- epochs: 10
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- scheduler: linear scheduler with num_warmup_steps=round(len(train_dataloader)/5) and num_training_steps=len(train_dataloader)*epochs
- maximum length of data sequence: 512
- patience: 2 epochs
- classifier dropout: 0.3
In the preprocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
model.
The training code with instructions is available in [GitHub](https://github.com/DALAI-project/Train_BERT_NER).
## Evaluation results
Evaluation results using the test dataset are listed below:
||Precision|Recall|F1-score
-|-|-|-
PERSON|0.90|0.91|0.90
ORG|0.84|0.87|0.86
LOC|0.84|0.86|0.85
GPE|0.91|0.91|0.91
PRODUCT|0.73|0.77|0.75
EVENT|0.69|0.73|0.71
DATE|0.90|0.92|0.91
JON|0.83|0.95|0.89
FIBC|0.95|0.99|0.97
NORP|0.91|0.95|0.93
The metrics were calculated using the [seqeval](https://github.com/chakki-works/seqeval) library.
## Acknowledgements
The model was developed in an ERDF-funded project "Using Artificial Intelligence to Improve the Quality and Usability of Digital Records"
(Dalai) in 2021-2023. The purpose of the project was to develop the automation of the digitisation of cultural heritage materials and the
automated description of such materials through artificial intelligence. The main target group comprises memory organisations, archives,
museums and libraries that digitise and provide digital materials to their customers, as well as companies that develop services related
to digitisation and the processing of digital materials.
Project partners were the National Archives of Finland, Central Archives for Finnish Business Records (Elka),
South-Eastern Finland University of Applied Sciences Ltd (Xamk) and Disec Ltd.
The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been
carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah). |