vaitekunas commited on
Commit
ae73b2e
1 Parent(s): 8496d9e

Add third NER-tag: medical technology

Browse files
Files changed (4) hide show
  1. README.md +22 -19
  2. config.json +12 -8
  3. model.safetensors +2 -2
  4. tokenizer.json +1 -6
README.md CHANGED
@@ -14,11 +14,12 @@ language:
14
 
15
  # Model
16
 
17
- NER-Model for disease/treatment entity recognition. The purpose of the model/data use is educational.
18
 
19
  The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
20
- * `B-D`, `I-D`: begin and inside tags for disease
21
- * `B-T`, `I-T`: begin and inside tags for treatment
 
22
  * `O` - outside entities (irrelevant)
23
 
24
  ```
@@ -26,32 +27,34 @@ The original dataset tags have been augmented with "inside"-Tags in order to han
26
  Acute obstructive hydrocephalus complicating bacterial meningitis in childhood
27
 
28
  # Real:
29
- Acute -> D
30
- obstructive -> D
31
- hydrocephalus -> D
32
- bacterial -> D
33
- meningitis -> D
34
 
35
  # Predictions:
36
- o##bs##truct##ive -> B-D + I-D + I-D + I-D
37
- h##ydro##ce##pha##lus -> B-D + I-D + I-D + I-D + I-D
38
- bacterial -> B-D
39
- men##ing##itis -> B-D + I-D + I-D
40
  ```
41
 
42
  # Sources
43
 
44
  This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
45
  fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
46
- dataset (1.550 training samples).
 
 
47
 
48
  # Performance
49
 
50
  The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.
51
 
52
- |Metric |Score |
53
- |---------|----------|
54
- Precision | 0.854523 |
55
- Recall | 0.859779 |
56
- F1 | 0.857143 |
57
- Accuracy | 0.919590 |
 
14
 
15
  # Model
16
 
17
+ NER-Model for disease/treatment/technology entity recognition. The purpose of the model/data use is educational.
18
 
19
  The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
20
+ * `B-DISEASE`, `I-DISEASE`: begin and inside tags for disease
21
+ * `B-TREATMENT`, `I-TREATMENT`: begin and inside tags for treatment
22
+ * `B-TECHNOLOGY`, `I-TECHNOLOGY`: begin and inside tags for technology
23
  * `O` - outside entities (irrelevant)
24
 
25
  ```
 
27
  Acute obstructive hydrocephalus complicating bacterial meningitis in childhood
28
 
29
  # Real:
30
+ Acute -> DISEASE
31
+ obstructive -> DISEASE
32
+ hydrocephalus -> DISEASE
33
+ bacterial -> DISEASE
34
+ meningitis -> DISEASE
35
 
36
  # Predictions:
37
+ o##bs##truct##ive -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
38
+ h##ydro##ce##pha##lus -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
39
+ bacterial -> B-DISEASE
40
+ men##ing##itis -> B-DISEASE + I-DISEASE + I-DISEASE
41
  ```
42
 
43
  # Sources
44
 
45
  This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
46
  fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
47
+ dataset (1.550 training samples). The initial version of this model was then used
48
+ to augment the medical technology [dataset](https://github.com/VictoriaDimanova/Robust-medical-NER/tree/main/Textcorpus). Both datasets were then used to train
49
+ this model.
50
 
51
  # Performance
52
 
53
  The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.
54
 
55
+ | Metric | Score |
56
+ |-----------|----------|
57
+ | Precision | 0.836892 |
58
+ | Recall | 0.766610 |
59
+ | F1 | 0.800211 |
60
+ | Accuracy | 0.935253 |
config.json CHANGED
@@ -10,18 +10,22 @@
10
  "hidden_size": 768,
11
  "id2label": {
12
  "0": "O",
13
- "1": "B-D",
14
- "2": "I-D",
15
- "3": "B-T",
16
- "4": "I-T"
 
 
17
  },
18
  "initializer_range": 0.02,
19
  "intermediate_size": 3072,
20
  "label2id": {
21
- "B-D": 1,
22
- "B-T": 3,
23
- "I-D": 2,
24
- "I-T": 4,
 
 
25
  "O": 0
26
  },
27
  "layer_norm_eps": 1e-12,
 
10
  "hidden_size": 768,
11
  "id2label": {
12
  "0": "O",
13
+ "1": "B-DISEASE",
14
+ "2": "I-DISEASE",
15
+ "3": "B-TREATMENT",
16
+ "4": "I-TREATMENT",
17
+ "5": "B-TECHNOLOGY",
18
+ "6": "I-TECHNOLOGY"
19
  },
20
  "initializer_range": 0.02,
21
  "intermediate_size": 3072,
22
  "label2id": {
23
+ "B-DISEASE": 1,
24
+ "B-TECHNOLOGY": 5,
25
+ "B-TREATMENT": 3,
26
+ "I-DISEASE": 2,
27
+ "I-TECHNOLOGY": 6,
28
+ "I-TREATMENT": 4,
29
  "O": 0
30
  },
31
  "layer_norm_eps": 1e-12,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:459b13eec5f15870478901aa449227f27f1f7a703682d33405eb24cb4cb587f4
3
- size 430917436
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92ded4be8adf990a1d90dba8b6961b9130a24f6ed8e9a12ac7aaf00e49968575
3
+ size 430923588
tokenizer.json CHANGED
@@ -1,11 +1,6 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 512,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
  "padding": null,
10
  "added_tokens": [
11
  {
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
 
 
 
 
 
4
  "padding": null,
5
  "added_tokens": [
6
  {