Kansallisarkisto
/

finbert-ner

Token Classification

Inference Endpoints

Model card Files Files and versions Community

MikkoLipsanen commited on Jun 27, 2023

Commit

6174c03

•

1 Parent(s): 1a852b6

Update README.md

Files changed (1) hide show

README.md +65 -1

README.md CHANGED Viewed

@@ -7,4 +7,68 @@ metrics:
 - accuracy
 library_name: transformers
 pipeline_tag: token-classification
----

 - accuracy
 library_name: transformers
 pipeline_tag: token-classification
+---
+## Finnish named entity recognition
+The model performs named entity recognition from text input in Finnish.
+It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
+using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
+as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland.
+Since the latter dataset contains also sensitive data, it has not been made publicly available.
+## Intended uses & limitations
+The model has been trained to recognize the following named entities from a text in Finnish:
+- PERSON (person names)
+- ORG (organizations)
+- LOC (locations)
+- GPE (geopolitical locations)
+- PRODUCT (products)
+- EVENT (events)
+- DATE (dates)
+- JON (Finnish journal numbers (diaarinumero))
+- FIBC (Finnish business identity codes (y-tunnus))
+- NORP (nationality, religious and political groups)
+Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that
+recognition accuracy for these entities also tends to be lower.
+The training data is relatively recent, so that the model might face difficulties when the input
+contains for example old names or writing styles.
+## How to use
+The easiest way to use the model is by utilizing the Transformers pipeline for token classification:
+```python
+from transformers import pipeline
+model_checkpoint = "Kansallisarkisto/finbert-ner"
+token_classifier = pipeline(
+    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
+)
+token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.")
+```
+## Training data
+Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
+dataset were filtered out from the dataset used for training the model. In addition to this dataset, OCR'd and annotated content of
+digitized documents from Finnish public administration was also used for model training. The number of entities belonging to the different
+entity classes contained in training, validation and test datasets are listed below:
+Number of entity types in the data
+Dataset|O|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
+-|-|-|-|-|-|-|-|-|-|-|-
+Train|0|0|0|0|0|0|0|0|0|0|0
+Val|0|0|0|0|0|0|0|0|0|0|0
+Test|0|0|0|0|0|0|0|0|0|0|0
+## Training procedure
+This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
+The training code with instructions is available [here](https://github.com/DALAI-hanke/BERT_NER).