README.md · neuralbioinfo/prokbert-mini-phage at main

metadata

license: cc-by-nc-4.0
tags:
  - prokbert
  - bioinformatics
  - genomics
  - sequence embedding
  - genomic language models
  - nucleotide
  - dna-sequence
  - promoter-prediction
  - phage

ProkBERT-mini-phage Model

This finetuned model is specifically designed for promoter identification and is based on the ProkBERT-mini model.

For more details, refer to the phage dataset description used for training and evaluating this model.

Example Usage

For practical examples on how to use this model, see the following Jupyter notebooks:

Training Notebook: A guide to fine-tuning the ProkBERT-mini model for promoter identification tasks.
Evaluation Notebook: Demonstrates how to evaluate the finetuned ProkBERT-mini-promoter model on test datasets.

Model Application

The model was trained for binary classification to distinguish between pahge and non-phage (bacteria) sequences. The non-phage sequences were sampled form the phage's host genome randomly.

Simple Usage Example

The following example demonstrates how to use the ProkBERT-mini-promoter model for processing a DNA sequence:

from prokbert.prokbert_tokenizer import ProkBERTTokenizer
from transformers import MegatronBertForSequenceClassification
finetuned_model = "neuralbioinfo/prokbert-mini-phage"
kmer = 6
shift= 1

tok_params = {'kmer' : kmer,
             'shift' : shift}
tokenizer = ProkBERTTokenizer(tokenization_params=tok_params)
model = MegatronBertForSequenceClassification.from_pretrained(finetuned_model)
sequence = 'CACCGCATGGAGATCGGCACCTACTTCGACAAGCTGGAGGCGCTGCTGAAGGAGTGGTACGAGGCGCGCGGGGGTGAGGCATGACGGACTGGCAAGAGGAGCAGCGTCAGCGC'
inputs = tokenizer(sequence, return_tensors="pt")
# Ensure that inputs have a batch dimension
inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
# Generate outputs from the model
outputs = model(**inputs)
print(outputs)

Model Details

Developed by: Neural Bioinformatics Research Group

Architecture:

... Tokenizer: The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.

Parameters:

Parameter	Description
Model Size	20.6 million parameters
Max. Context Size	1024 bp
Training Data	206.65 billion nucleotides
Layers	6
Attention Heads	6

Intended Use

Intended Use Cases: ProkBERT-mini-phage is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:

sequence classification tasks
Exploration of genomic patterns and features

Installation of ProkBERT (if needed)

For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):

try:
    import prokbert
    print("ProkBERT is already installed.")
except ImportError:
    !pip install prokbert
    print("Installed ProkBERT.")

Evaluation on phage recognition benchmark dataset

method	L	auc_class1	acc	f1	mcc	recall	sensitivity	specificity	tn	fp	fn	tp	Np	Nn	eval_time
DeepVirFinder	256	0.734914	0.627163	0.481213	0.309049	0.345317	0.345317	0.909856	4542	450	3278	1729	5007	4992	7580
DeepVirFinder	512	0.791423	0.708	0.637717	0.443065	0.521192	0.521192	0.889722	4510	559	2361	2570	4931	5069	2637
DeepVirFinder	1024	0.826255	0.7424	0.702678	0.505333	0.605651	0.605651	0.880579	4380	594	1982	3044	5026	4974	1294
DeepVirFinder	2048	0.853098	0.7717	0.743339	0.557177	0.6612	0.6612	0.8822	4411	589	1694	3306	5000	5000	1351
INHERIT	256	0.75982	0.6943	0.67012	0.393179	0.620008	0.620008	0.76883	3838	1154	1903	3105	5008	4992	2131
INHERIT	512	0.816326	0.7228	0.651408	0.479323	0.525248	0.525248	0.914973	4638	431	2341	2590	4931	5069	2920
INHERIT	1024	0.846547	0.7264	0.659447	0.495935	0.527059	0.527059	0.927825	4615	359	2377	2649	5026	4974	3055
INHERIT	2048	0.864122	0.7365	0.668595	0.518541	0.5316	0.5316	0.9414	4707	293	2342	2658	5000	5000	3225
MINI	256	0.846745	0.7755	0.766462	0.552855	0.735623	0.735623	0.815505	4071	921	1324	3684	5008	4992	6.68888
MINI	512	0.924973	0.8657	0.859121	0.732696	0.83046	0.83046	0.89998	4562	507	836	4095	4931	5069	16.3681
MINI	1024	0.956432	0.9138	0.911189	0.829645	0.879825	0.879825	0.94813	4716	258	604	4422	5026	4974	51.3319
MINI-C	256	0.827635	0.7512	0.7207	0.51538	0.640974	0.640974	0.861779	4302	690	1798	3210	5008	4992	7.33697
MINI-C	512	0.913378	0.8466	0.834876	0.69725	0.786453	0.786453	0.905109	4588	481	1053	3878	4931	5069	17.6749
MINI-C	1024	0.94644	0.8937	0.891564	0.788427	0.869479	0.869479	0.918175	4567	407	656	4370	5026	4974	54.204
MINI-LONG	256	0.777697	0.71495	0.686224	0.437727	0.622404	0.622404	0.807792	8065	1919	3782	6234	10016	9984	6.10304
MINI-LONG	512	0.880831	0.81405	0.798001	0.632855	0.744879	0.744879	0.881338	8935	1203	2516	7346	9862	10138	12.1307
MINI-LONG	1024	0.9413	0.88925	0.884917	0.781465	0.847195	0.847195	0.931745	9269	679	1536	8516	10052	9948	30.5088
MINI-LONG	2048	0.964551	0.929	0.927455	0.85878	0.9077	0.9077	0.9503	9503	497	923	9077	10000	10000	94.404
Virsorter2	512	0.620782	0.6259	0.394954	0.364831	0.247617	0.247617	0.993884	5038	31	3710	1221	4931	5069	2057
Virsorter2	1024	0.719898	0.7178	0.621919	0.51036	0.461799	0.461799	0.976478	4857	117	2705	2321	5026	4974	3258
Virsorter2	2048	0.816142	0.8103	0.778724	0.647532	0.6676	0.6676	0.953	4765	235	1662	3338	5000	5000	5737

Column Descriptions

method: The algorithm or method used for prediction (e.g., DeepVirFinder, INHERIT).
L: Length of the genomic segment.
auc_class1: Area under the ROC curve for class 1, indicating the model's ability to distinguish between classes.
acc: Accuracy of the prediction, representing the proportion of true results (both true positives and true negatives) among the total number of cases examined.
f1: The F1 score, a measure of a test's accuracy that considers both the precision and the recall.
mcc: Matthews correlation coefficient, a quality measure for binary (two-class) classifications.
recall: The recall, or true positive rate, measures the proportion of actual positives that are correctly identified.
sensitivity: Sensitivity or true positive rate; identical to recall.
specificity: The specificity, or true negative rate, measures the proportion of actual negatives that are correctly identified.
fp: The number of false positives, indicating how many negative class samples were incorrectly identified as positive.
tp: The number of true positives, indicating how many positive class samples were correctly identified.
eval_time: The time taken to evaluate the model or method, usually in seconds.

Ethical Considerations and Limitations

Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.

Reporting Issues

Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:

Model issues: GitHub repository link
Feedback and inquiries: [email protected]

Reference

If you use ProkBERT in your research, please cite the following paper:

@ARTICLE{10.3389/fmicb.2023.1331233,
    AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
    TITLE={ProkBERT family: genomic language models for microbiome applications},
    JOURNAL={Frontiers in Microbiology},
    VOLUME={14},
    YEAR={2024},
    URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
    DOI={10.3389/fmicb.2023.1331233},
    ISSN={1664-302X},
    ABSTRACT={...}
}