prokbert-mini-phage / README.md
ligeti's picture
Update README.md
4732d24 verified
metadata
license: cc-by-nc-4.0
tags:
  - prokbert
  - bioinformatics
  - genomics
  - sequence embedding
  - genomic language models
  - nucleotide
  - dna-sequence
  - promoter-prediction
  - phage

ProkBERT-mini-phage Model

This finetuned model is specifically designed for promoter identification and is based on the ProkBERT-mini model.

For more details, refer to the phage dataset description used for training and evaluating this model.

Example Usage

For practical examples on how to use this model, see the following Jupyter notebooks:

  • Training Notebook: A guide to fine-tuning the ProkBERT-mini model for promoter identification tasks.
  • Evaluation Notebook: Demonstrates how to evaluate the finetuned ProkBERT-mini-promoter model on test datasets.

Model Application

The model was trained for binary classification to distinguish between pahge and non-phage (bacteria) sequences. The non-phage sequences were sampled form the phage's host genome randomly.

Simple Usage Example

The following example demonstrates how to use the ProkBERT-mini-promoter model for processing a DNA sequence:

from prokbert.prokbert_tokenizer import ProkBERTTokenizer
from transformers import MegatronBertForSequenceClassification
finetuned_model = "neuralbioinfo/prokbert-mini-phage"
kmer = 6
shift= 1

tok_params = {'kmer' : kmer,
             'shift' : shift}
tokenizer = ProkBERTTokenizer(tokenization_params=tok_params)
model = MegatronBertForSequenceClassification.from_pretrained(finetuned_model)
sequence = 'CACCGCATGGAGATCGGCACCTACTTCGACAAGCTGGAGGCGCTGCTGAAGGAGTGGTACGAGGCGCGCGGGGGTGAGGCATGACGGACTGGCAAGAGGAGCAGCGTCAGCGC'
inputs = tokenizer(sequence, return_tensors="pt")
# Ensure that inputs have a batch dimension
inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
# Generate outputs from the model
outputs = model(**inputs)
print(outputs)

Model Details

Developed by: Neural Bioinformatics Research Group

Architecture:

... Tokenizer: The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.

Parameters:

Parameter Description
Model Size 20.6 million parameters
Max. Context Size 1024 bp
Training Data 206.65 billion nucleotides
Layers 6
Attention Heads 6

Intended Use

Intended Use Cases: ProkBERT-mini-phage is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:

  • sequence classification tasks
  • Exploration of genomic patterns and features

Installation of ProkBERT (if needed)

For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):

try:
    import prokbert
    print("ProkBERT is already installed.")
except ImportError:
    !pip install prokbert
    print("Installed ProkBERT.")

Evaluation on phage recognition benchmark dataset

method L auc_class1 acc f1 mcc recall sensitivity specificity tn fp fn tp Np Nn eval_time
DeepVirFinder 256 0.734914 0.627163 0.481213 0.309049 0.345317 0.345317 0.909856 4542 450 3278 1729 5007 4992 7580
DeepVirFinder 512 0.791423 0.708 0.637717 0.443065 0.521192 0.521192 0.889722 4510 559 2361 2570 4931 5069 2637
DeepVirFinder 1024 0.826255 0.7424 0.702678 0.505333 0.605651 0.605651 0.880579 4380 594 1982 3044 5026 4974 1294
DeepVirFinder 2048 0.853098 0.7717 0.743339 0.557177 0.6612 0.6612 0.8822 4411 589 1694 3306 5000 5000 1351
INHERIT 256 0.75982 0.6943 0.67012 0.393179 0.620008 0.620008 0.76883 3838 1154 1903 3105 5008 4992 2131
INHERIT 512 0.816326 0.7228 0.651408 0.479323 0.525248 0.525248 0.914973 4638 431 2341 2590 4931 5069 2920
INHERIT 1024 0.846547 0.7264 0.659447 0.495935 0.527059 0.527059 0.927825 4615 359 2377 2649 5026 4974 3055
INHERIT 2048 0.864122 0.7365 0.668595 0.518541 0.5316 0.5316 0.9414 4707 293 2342 2658 5000 5000 3225
MINI 256 0.846745 0.7755 0.766462 0.552855 0.735623 0.735623 0.815505 4071 921 1324 3684 5008 4992 6.68888
MINI 512 0.924973 0.8657 0.859121 0.732696 0.83046 0.83046 0.89998 4562 507 836 4095 4931 5069 16.3681
MINI 1024 0.956432 0.9138 0.911189 0.829645 0.879825 0.879825 0.94813 4716 258 604 4422 5026 4974 51.3319
MINI-C 256 0.827635 0.7512 0.7207 0.51538 0.640974 0.640974 0.861779 4302 690 1798 3210 5008 4992 7.33697
MINI-C 512 0.913378 0.8466 0.834876 0.69725 0.786453 0.786453 0.905109 4588 481 1053 3878 4931 5069 17.6749
MINI-C 1024 0.94644 0.8937 0.891564 0.788427 0.869479 0.869479 0.918175 4567 407 656 4370 5026 4974 54.204
MINI-LONG 256 0.777697 0.71495 0.686224 0.437727 0.622404 0.622404 0.807792 8065 1919 3782 6234 10016 9984 6.10304
MINI-LONG 512 0.880831 0.81405 0.798001 0.632855 0.744879 0.744879 0.881338 8935 1203 2516 7346 9862 10138 12.1307
MINI-LONG 1024 0.9413 0.88925 0.884917 0.781465 0.847195 0.847195 0.931745 9269 679 1536 8516 10052 9948 30.5088
MINI-LONG 2048 0.964551 0.929 0.927455 0.85878 0.9077 0.9077 0.9503 9503 497 923 9077 10000 10000 94.404
Virsorter2 512 0.620782 0.6259 0.394954 0.364831 0.247617 0.247617 0.993884 5038 31 3710 1221 4931 5069 2057
Virsorter2 1024 0.719898 0.7178 0.621919 0.51036 0.461799 0.461799 0.976478 4857 117 2705 2321 5026 4974 3258
Virsorter2 2048 0.816142 0.8103 0.778724 0.647532 0.6676 0.6676 0.953 4765 235 1662 3338 5000 5000 5737

Column Descriptions

  • method: The algorithm or method used for prediction (e.g., DeepVirFinder, INHERIT).
  • L: Length of the genomic segment.
  • auc_class1: Area under the ROC curve for class 1, indicating the model's ability to distinguish between classes.
  • acc: Accuracy of the prediction, representing the proportion of true results (both true positives and true negatives) among the total number of cases examined.
  • f1: The F1 score, a measure of a test's accuracy that considers both the precision and the recall.
  • mcc: Matthews correlation coefficient, a quality measure for binary (two-class) classifications.
  • recall: The recall, or true positive rate, measures the proportion of actual positives that are correctly identified.
  • sensitivity: Sensitivity or true positive rate; identical to recall.
  • specificity: The specificity, or true negative rate, measures the proportion of actual negatives that are correctly identified.
  • fp: The number of false positives, indicating how many negative class samples were incorrectly identified as positive.
  • tp: The number of true positives, indicating how many positive class samples were correctly identified.
  • eval_time: The time taken to evaluate the model or method, usually in seconds.

Ethical Considerations and Limitations

Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.

Reporting Issues

Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:

Reference

If you use ProkBERT in your research, please cite the following paper:

@ARTICLE{10.3389/fmicb.2023.1331233,
    AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
    TITLE={ProkBERT family: genomic language models for microbiome applications},
    JOURNAL={Frontiers in Microbiology},
    VOLUME={14},
    YEAR={2024},
    URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
    DOI={10.3389/fmicb.2023.1331233},
    ISSN={1664-302X},
    ABSTRACT={...}
}