inventory / named_entity_recognition_modelcard.md

Add model card for NER model

d149041 almost 2 years ago

5.4 kB

	---
	license: mit
	---

	# Model Card for named_entity_recognition.pt

	This is a fine-tuned model checkpoint for the named entity recognition (NER) task used in the biodata resource inventory performed by the
	[Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/).

	# Model Details

	## Model Description

	This model has been fine-tuned to detect resource names in scientific articles (title and abstract). This is done using a token classification which assigns predicted
	token labels following the [BIO scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). These are post-processed to determine the
	predicted "common names" (often an acronym) and "full names" of a resource present in an article.



	- Developed by: Ana-Maria Istrate and Kenneth E. Schackart III
	- Shared by: Kenneth E. Schackart III
	- Model type: RoBERTa (BERT; Transformer)
	- Language(s) (NLP): Python
	- License: MIT
	- Finetuned from model: https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500

	## Model Sources

	- Repository: https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
	- Paper [optional]: TBA
	- Demo [optional]: TBA

	# Uses

	This model can be used find predicted biodata resource names in an article's title and abstract

	## Direct Use

	Direct use of the model has not been assessed or designed.

	## Out-of-Scope Use

	Model should not be used for anything other than the use described in [uses](named_entity_recognition_modelcard.md#uses).

	# Bias, Risks, and Limitations

	Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora
	as described in [Gururangan S., et al., 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were
	manually annotated by 2 curators. Biases in the manual annotation may have affected model fine-tuning. Additionally, manually annotated data were
	procured using a specific search query to Europe PMC, so generalizability may be limited when applying to articles from other sources.

	## Recommendations

	The model should only be used for identifying resource names in articles from Europe PMC using the
	[query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository.
	Additionally, only article predicted or known to describe a biodata resource should be used.

	## How to Get Started with the Model

	Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev).

	# Training Details

	## Training Data

	The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).

	Note: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.

	## Training Procedure

	The model was trained for 10 epochs, and F1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest F1-score on the validation
	set was saved (regardless of epoch number).

	### Preprocessing

	To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All
	XML tags were removed using a regular expression.

	### Speeds, Sizes, Times

	The model checkpoint is 496 MB. Speed has not been benchmarked.

	# Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	## Testing Data, Factors & Metrics

	### Testing Data

	<!-- This should link to a Data Card if possible. -->

	The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).

	### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	The model was evaluated using F1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.

	## Results

	- F1-score: 0.717
	- Precision: 0.689
	- Recall: 0.748

	### Summary



	# Model Examination

	The model works satisfactorily for identifying resource names from articles describing biodata resources in the literature.

	## Model Architecture and Objective

	The base model architecture is as described in [Gururangan S., et al., 2020](http://arxiv.org/abs/2004.10964). Token classification is performed using
	a linear sequence classification layer initialized using [transformers.AutoModelForTokenClassification()](https://huggingface.co/docs/transformers/model_doc/auto).

	## Compute Infrastructure

	Model was fine-tuned on Google Colaboratory.

	### Hardware

	Model was fine-tuned using GPU acceleration provided by Google Colaboratory.

	### Software

	Training software was written in Python.

	# Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	TBA

	BibTeX:

	TBA

	APA:

	TBA

	# Model Card Authors

	This model card was written by Kenneth E. Schackart III.

	# Model Card Contact

	Ken Schackart: <[email protected]>