Kansallisarkisto
/

finbert-ner

Token Classification

Inference Endpoints

Model card Files Files and versions Community

MikkoLipsanen commited on Jun 28, 2023

Commit

373cfeb

•

1 Parent(s): 5afd4b8

Update README.md

Files changed (1) hide show

README.md +15 -3

README.md CHANGED Viewed

@@ -56,8 +56,11 @@ token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vu
 ## Training data
 Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
-dataset were filtered out from the dataset used for training the model. In addition to this dataset, OCR'd and annotated content of
-digitized documents from Finnish public administration was also used for model training. The number of entities belonging to the different
 entity classes contained in training, validation and test datasets are listed below:
 ### Number of entity types in the data
@@ -67,6 +70,10 @@ Train|11691|30026|868|12999|7473|1184|14918|01360|1879|2068
 Val|1542|4042|108|1654|879|160|1858|177|257|299
 Test|1267|3698|86|1713|901|137|1843|174|233|260
 ## Training procedure
 This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
@@ -79,4 +86,9 @@ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperpara
 - maximum length of data sequence: 512
 - patience: 2 epochs
-The training code with instructions is available [here](https://github.com/DALAI-hanke/BERT_NER).

 ## Training data
 Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
+dataset were filtered out from the dataset used for training the model.
+In addition to this dataset, OCR'd and annotated content of
+digitized documents from Finnish public administration was also used for model training.
+The number of entities belonging to the different
 entity classes contained in training, validation and test datasets are listed below:
 ### Number of entity types in the data
 Val|1542|4042|108|1654|879|160|1858|177|257|299
 Test|1267|3698|86|1713|901|137|1843|174|233|260
+The annotation of the data was performed as a cooperation between the National Archives of Finland
+and the [FIN-CLARIAH](https://www.kielipankki.fi/organization/fin-clariah/) research infrastructure
+for Social Sciences and Humanities.
 ## Training procedure
 This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
 - maximum length of data sequence: 512
 - patience: 2 epochs
+In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
+in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
+using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
+model.
+The training code with instructions will be available soon [here](https://github.com/DALAI-hanke/BERT_NER).