rahuldhodapkar
/

protgpt2-finetuned-sarscov2-rbd

Text Generation

Generated from Trainer

Text Generation

Primary Sequence Prediction

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

protgpt2-finetuned-sarscov2-rbd / README.md

rahuldhodapkar's picture

initial commit

edb7a65 almost 2 years ago

|

history blame contribute delete

2.88 kB

	---
	license: cc-by-nc-nd-4.0
	metrics:
	- accuracy
	tags:
	- generated_from_trainer
	- Text Generation
	- Primary Sequence Prediction
	model-index:
	- name: protgpt2-finetuned-sarscov2-rbd
	results: []
	---

	# Model Card for `protgpt2-finetuned-sarscov2-rbd`

	This model is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) on sequences from the NCBI Virus Data Portal.

	It achieves the following results on the evaluation set:
	- Loss: 1.1674
	- Accuracy: 0.8883

	## Model description

	This model is a fine-tuned checkpoint of
	[ProtGPT2](https://huggingface.co/nferruz/ProtGPT2), which was originally
	trained on the UniRef50 (version 2021_04) database. For a detailed overview
	of the original model configuration and architecture, please see the linked
	model card, or refer to the ProtGPT2 publication.

	The model was finetuned on data from the SARS-CoV-2 Spike (surface glycoprotein)
	receptor binding domain (RBD).

	A repository with the training scripts, train and test data partitions, as well
	as evaluation code is available on GitHub at
	(https://github.com/rahuldhodapkar/PredictSARSVariants).

	## Intended uses & limitations

	This model is intended to generate synthetic SARS-CoV-2 surface glycoprotein
	(a.k.a. spike protein) sequences for the purpose of identifying meaningful
	variants for characterization either experimentally or through other
	in silico tools. These variants may be used to drive vaccine develop to
	protect against never-before-seen point mutants that are probable in the future.

	As this model is based on the original ProtGPT2 model, it is subject to many
	of the same limitations as the base model. Any biases present in the UniRef50
	dataset will also be present in the model, which may include nonuniform skew
	of peptides sampled across different taxonomic clades. These limitations
	should be considered when interpreting the output of this model.

	## Training and evaluation data

	SARS-CoV-2 spike protein sequences were obtained from the NIH Sars-CoV-2 Data Hub
	accessible at

	https://www.ncbi.nlm.nih.gov/labs/virus/vssi/

	Note that the reference sequence for the surface glycoprotein can be found at:

	https://www.ncbi.nlm.nih.gov/protein/1791269090

	As the loaded ProtGPT2 model was pretrained on the
	UniRef50 (version 2021_04) dataset, it cannot have contained sequencing
	data that was generated after that date. Evaluations will be conducted using
	SARS-CoV-2 sequences generated on or after May 2021.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3.0

	### Framework versions

	- Transformers 4.26.0.dev0
	- Pytorch 1.11.0
	- Datasets 2.8.0
	- Tokenizers 0.13.2