vaitekunas
/

biobert-fachpraktikum

Token Classification

Inference Endpoints

Model card Files Files and versions Community

biobert-fachpraktikum / README.md

vaitekunas's picture

Add third NER-tag: medical technology

ae73b2e 3 months ago

|

history blame contribute delete

No virus

2.01 kB

	---
	license: mit
	pipeline_tag: token-classification
	tags:
	- BERT
	- bioBERT
	- NER
	- medical
	metrics:
	- f1
	language:
	- en
	---

	# Model

	NER-Model for disease/treatment/technology entity recognition. The purpose of the model/data use is educational.

	The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
	* `B-DISEASE`, `I-DISEASE`: begin and inside tags for disease
	* `B-TREATMENT`, `I-TREATMENT`: begin and inside tags for treatment
	* `B-TECHNOLOGY`, `I-TECHNOLOGY`: begin and inside tags for technology
	* `O` - outside entities (irrelevant)

	```
	# Text:
	Acute obstructive hydrocephalus complicating bacterial meningitis in childhood

	# Real:
	Acute -> DISEASE
	obstructive -> DISEASE
	hydrocephalus -> DISEASE
	bacterial -> DISEASE
	meningitis -> DISEASE

	# Predictions:
	o##bs##truct##ive -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
	h##ydro##ce##pha##lus -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
	bacterial -> B-DISEASE
	men##ing##itis -> B-DISEASE + I-DISEASE + I-DISEASE
	```

	# Sources

	This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
	fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
	dataset (1.550 training samples). The initial version of this model was then used
	to augment the medical technology [dataset](https://github.com/VictoriaDimanova/Robust-medical-NER/tree/main/Textcorpus). Both datasets were then used to train
	this model.

	# Performance

	The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.

	\| Metric \| Score \|
	\|-----------\|----------\|
	\| Precision \| 0.836892 \|
	\| Recall \| 0.766610 \|
	\| F1 \| 0.800211 \|
	\| Accuracy \| 0.935253 \|