Add third NER-tag: medical technology

Browse files

Files changed (4) hide show

README.md +22 -19
config.json +12 -8
model.safetensors +2 -2
tokenizer.json +1 -6

README.md CHANGED Viewed

@@ -14,11 +14,12 @@ language:
 # Model
-NER-Model for disease/treatment entity recognition. The purpose of the model/data use is educational.
 The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
-* `B-D`, `I-D`: begin and inside tags for disease
-* `B-T`, `I-T`: begin and inside tags for treatment
 * `O` - outside entities (irrelevant)
 ```
@@ -26,32 +27,34 @@ The original dataset tags have been augmented with "inside"-Tags in order to han
 Acute obstructive hydrocephalus complicating bacterial meningitis in childhood
 # Real:
-Acute           -> D
-obstructive     -> D
-hydrocephalus   -> D
-bacterial       -> D
-meningitis      -> D
 # Predictions:
-o##bs##truct##ive     -> B-D + I-D + I-D + I-D
-h##ydro##ce##pha##lus -> B-D + I-D + I-D + I-D + I-D
-bacterial             -> B-D
-men##ing##itis        -> B-D + I-D + I-D
 ```
 # Sources
 This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
 fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
-dataset (1.550 training samples).
 # Performance
 The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.
-|Metric   |Score     |
-|---------|----------|
-Precision | 0.854523 |
-Recall    | 0.859779 |
-F1        | 0.857143 |
-Accuracy  | 0.919590 |

 # Model
+NER-Model for disease/treatment/technology entity recognition. The purpose of the model/data use is educational.
 The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
+* `B-DISEASE`, `I-DISEASE`: begin and inside tags for disease
+* `B-TREATMENT`, `I-TREATMENT`: begin and inside tags for treatment
+* `B-TECHNOLOGY`, `I-TECHNOLOGY`: begin and inside tags for technology
 * `O` - outside entities (irrelevant)
 ```
 Acute obstructive hydrocephalus complicating bacterial meningitis in childhood
 # Real:
+Acute           -> DISEASE
+obstructive     -> DISEASE
+hydrocephalus   -> DISEASE
+bacterial       -> DISEASE
+meningitis      -> DISEASE
 # Predictions:
+o##bs##truct##ive     -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
+h##ydro##ce##pha##lus -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
+bacterial             -> B-DISEASE
+men##ing##itis        -> B-DISEASE + I-DISEASE + I-DISEASE
 ```
 # Sources
 This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
 fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
+dataset (1.550 training samples). The initial version of this model was then used
+to augment the medical technology [dataset](https://github.com/VictoriaDimanova/Robust-medical-NER/tree/main/Textcorpus). Both datasets were then used to train
+this model.
 # Performance
 The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.
+| Metric    | Score    |
+|-----------|----------|
+| Precision | 0.836892 |
+| Recall    | 0.766610 |
+| F1        | 0.800211 |
+| Accuracy  | 0.935253 |

config.json CHANGED Viewed

@@ -10,18 +10,22 @@
   "hidden_size": 768,
   "id2label": {
     "0": "O",
-    "1": "B-D",
-    "2": "I-D",
-    "3": "B-T",
-    "4": "I-T"
   },
   "initializer_range": 0.02,
   "intermediate_size": 3072,
   "label2id": {
-    "B-D": 1,
-    "B-T": 3,
-    "I-D": 2,
-    "I-T": 4,
     "O": 0
   },
   "layer_norm_eps": 1e-12,

   "hidden_size": 768,
   "id2label": {
     "0": "O",
+    "1": "B-DISEASE",
+    "2": "I-DISEASE",
+    "3": "B-TREATMENT",
+    "4": "I-TREATMENT",
+    "5": "B-TECHNOLOGY",
+    "6": "I-TECHNOLOGY"
   },
   "initializer_range": 0.02,
   "intermediate_size": 3072,
   "label2id": {
+    "B-DISEASE": 1,
+    "B-TECHNOLOGY": 5,
+    "B-TREATMENT": 3,
+    "I-DISEASE": 2,
+    "I-TECHNOLOGY": 6,
+    "I-TREATMENT": 4,
     "O": 0
   },
   "layer_norm_eps": 1e-12,

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:459b13eec5f15870478901aa449227f27f1f7a703682d33405eb24cb4cb587f4
-size 430917436

 version https://git-lfs.github.com/spec/v1
+oid sha256:92ded4be8adf990a1d90dba8b6961b9130a24f6ed8e9a12ac7aaf00e49968575
+size 430923588

tokenizer.json CHANGED Viewed

@@ -1,11 +1,6 @@
 {
   "version": "1.0",
-  "truncation": {
-    "direction": "Right",
-    "max_length": 512,
-    "strategy": "LongestFirst",
-    "stride": 0
-  },
   "padding": null,
   "added_tokens": [
     {

 {
   "version": "1.0",
+  "truncation": null,
   "padding": null,
   "added_tokens": [
     {