license: mit
Model Card for named_entity_recognition.pt
This is a fine-tuned model checkpoint for the named entity recognition (NER) task used in the biodata resource inventory performed by the Global Biodata Coalition in collaboration with Chan Zuckerberg Initiative.
Model Details
Model Description
This model has been fine-tuned to detect resource names in scientific articles (title and abstract). This is done using a token classification which assigns predicted token labels following the BIO scheme. These are post-processed to determine the predicted "common names" (often an acronym) and "full names" of a resource present in an article.
- Developed by: Ana-Maria Istrate and Kenneth E. Schackart III
- Shared by: Kenneth E. Schackart III
- Model type: RoBERTa (BERT; Transformer)
- Language(s) (NLP): Python
- License: MIT
- Finetuned from model: https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
Model Sources
- Repository: https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
- Paper [optional]: TBA
- Demo [optional]: TBA
Uses
This model can be used find predicted biodata resource names in an article's title and abstract
Direct Use
Direct use of the model has not been assessed or designed.
Out-of-Scope Use
Model should not be used for anything other than the use described in uses.
Bias, Risks, and Limitations
Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora as described in Gururangan S., et al., 2020. Second, The model was fine-tuned on scientific articles that were manually annotated by 2 curators. Biases in the manual annotation may have affected model fine-tuning. Additionally, manually annotated data were procured using a specific search query to Europe PMC, so generalizability may be limited when applying to articles from other sources.
Recommendations
The model should only be used for identifying resource names in articles from Europe PMC using the query present in the GitHub repository. Additionally, only article predicted or known to describe a biodata resource should be used.
How to Get Started with the Model
Follow the direction in the GitHub repository.
Training Details
Training Data
The model was trained on the training split from the labeled training data.
Note: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.
Training Procedure
The model was trained for 10 epochs, and F1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest F1-score on the validation set was saved (regardless of epoch number).
Preprocessing
To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All XML tags were removed using a regular expression.
Speeds, Sizes, Times
The model checkpoint is 496 MB. Speed has not been benchmarked.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated using the test split of the labeled data.
Metrics
The model was evaluated using F1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.
Results
- F1-score: 0.717
- Precision: 0.689
- Recall: 0.748
Summary
Model Examination
The model works satisfactorily for identifying resource names from articles describing biodata resources in the literature.
Model Architecture and Objective
The base model architecture is as described in Gururangan S., et al., 2020. Token classification is performed using a linear sequence classification layer initialized using transformers.AutoModelForTokenClassification().
Compute Infrastructure
Model was fine-tuned on Google Colaboratory.
Hardware
Model was fine-tuned using GPU acceleration provided by Google Colaboratory.
Software
Training software was written in Python.
Citation
TBA
BibTeX:
TBA
APA:
TBA
Model Card Authors
This model card was written by Kenneth E. Schackart III.
Model Card Contact
Ken Schackart: [email protected]