|
--- |
|
license: cc-by-nc-nd-4.0 |
|
metrics: |
|
- accuracy |
|
tags: |
|
- generated_from_trainer |
|
- Text Generation |
|
- Primary Sequence Prediction |
|
model-index: |
|
- name: protgpt2-finetuned-sarscov2-rbd |
|
results: [] |
|
--- |
|
|
|
# Model Card for `protgpt2-finetuned-sarscov2-rbd` |
|
|
|
This model is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) on sequences from the NCBI Virus Data Portal. |
|
|
|
It achieves the following results on the evaluation set: |
|
- Loss: 1.1674 |
|
- Accuracy: 0.8883 |
|
|
|
## Model description |
|
|
|
This model is a fine-tuned checkpoint of |
|
[ProtGPT2](https://huggingface.co/nferruz/ProtGPT2), which was originally |
|
trained on the UniRef50 (version 2021_04) database. For a detailed overview |
|
of the original model configuration and architecture, please see the linked |
|
model card, or refer to the ProtGPT2 publication. |
|
|
|
The model was finetuned on data from the SARS-CoV-2 Spike (surface glycoprotein) |
|
receptor binding domain (RBD). |
|
|
|
A repository with the training scripts, train and test data partitions, as well |
|
as evaluation code is available on GitHub at |
|
(https://github.com/rahuldhodapkar/PredictSARSVariants). |
|
|
|
## Intended uses & limitations |
|
|
|
This model is intended to generate synthetic SARS-CoV-2 surface glycoprotein |
|
(a.k.a. spike protein) sequences for the purpose of identifying meaningful |
|
variants for characterization either experimentally or through other |
|
*in silico* tools. These variants may be used to drive vaccine develop to |
|
protect against never-before-seen point mutants that are probable in the future. |
|
|
|
As this model is based on the original ProtGPT2 model, it is subject to many |
|
of the same limitations as the base model. Any biases present in the UniRef50 |
|
dataset will also be present in the model, which may include nonuniform skew |
|
of peptides sampled across different taxonomic clades. These limitations |
|
should be considered when interpreting the output of this model. |
|
|
|
## Training and evaluation data |
|
|
|
SARS-CoV-2 spike protein sequences were obtained from the NIH Sars-CoV-2 Data Hub |
|
accessible at |
|
|
|
https://www.ncbi.nlm.nih.gov/labs/virus/vssi/ |
|
|
|
Note that the reference sequence for the surface glycoprotein can be found at: |
|
|
|
https://www.ncbi.nlm.nih.gov/protein/1791269090 |
|
|
|
As the loaded ProtGPT2 model was pretrained on the |
|
UniRef50 (version 2021_04) dataset, it cannot have contained sequencing |
|
data that was generated after that date. Evaluations will be conducted using |
|
SARS-CoV-2 sequences generated on or after May 2021. |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 3.0 |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.26.0.dev0 |
|
- Pytorch 1.11.0 |
|
- Datasets 2.8.0 |
|
- Tokenizers 0.13.2 |
|
|
|
|