Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
---
|
4 |
+
## ProkBERT-mini-pahge Model
|
5 |
+
|
6 |
+
This finetuned model is specifically designed for promoter identification and is based on the [ProkBERT-mini model](https://huggingface.co/neuralbioinfo/prokbert-mini).
|
7 |
+
|
8 |
+
For more details, refer to the [pahge dataset description](https://huggingface.co/datasets/neuralbioinfo/phage-test-10k) used for training and evaluating this model.
|
9 |
+
|
10 |
+
### Example Usage
|
11 |
+
|
12 |
+
For practical examples on how to use this model, see the following Jupyter notebooks:
|
13 |
+
|
14 |
+
- [Training Notebook](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Finetuning.ipynb): A guide to fine-tuning the ProkBERT-mini model for promoter identification tasks.
|
15 |
+
- [Evaluation Notebook](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Inference.ipynb): Demonstrates how to evaluate the finetuned ProkBERT-mini-promoter model on test datasets.
|
16 |
+
|
17 |
+
### Model Application
|
18 |
+
|
19 |
+
The model was trained for binary classification to distinguish between promoter and non-promoter sequences. The length and composition of the promoter sequences were standardized to ensure compatibility with alternative methods and to facilitate direct comparison of model performance.
|
20 |
+
|
21 |
+
|
22 |
+
|
23 |
+
## Simple Usage Example
|
24 |
+
|
25 |
+
The following example demonstrates how to use the ProkBERT-mini-promoter model for processing a DNA sequence:
|
26 |
+
|
27 |
+
```python
|
28 |
+
from prokbert.prokbert_tokenizer import ProkBERTTokenizer
|
29 |
+
from prokbert.models import BertForBinaryClassificationWithPooling
|
30 |
+
finetuned_model = "neuralbioinfo/prokbert-mini-promoter"
|
31 |
+
kmer = 6
|
32 |
+
shift= 1
|
33 |
+
|
34 |
+
tok_params = {'kmer' : kmer,
|
35 |
+
'shift' : shift}
|
36 |
+
tokenizer = ProkBERTTokenizer(tokenization_params=tok_params)
|
37 |
+
model = BertForBinaryClassificationWithPooling.from_pretrained(finetuned_model)
|
38 |
+
sequence = 'TAGCGCATAATGATTTCCTTATAAGCGATCGCTCTGAAAGCGTTCTACGATAATAATGATATCCTTTCAATAATAGCGTAT'
|
39 |
+
inputs = tokenizer(sequence, return_tensors="pt")
|
40 |
+
# Ensure that inputs have a batch dimension
|
41 |
+
inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
|
42 |
+
# Generate outputs from the model
|
43 |
+
outputs = model(**inputs)
|
44 |
+
print(outputs)
|
45 |
+
|
46 |
+
```
|
47 |
+
|
48 |
+
### Model Details
|
49 |
+
|
50 |
+
**Developed by:** Neural Bioinformatics Research Group
|
51 |
+
|
52 |
+
**Architecture:**
|
53 |
+
|
54 |
+
...
|
55 |
+
**Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
|
56 |
+
|
57 |
+
**Parameters:**
|
58 |
+
|
59 |
+
| Parameter | Description |
|
60 |
+
|----------------------|--------------------------------------|
|
61 |
+
| Model Size | 20.6 million parameters |
|
62 |
+
| Max. Context Size | 1024 bp |
|
63 |
+
| Training Data | 206.65 billion nucleotides |
|
64 |
+
| Layers | 6 |
|
65 |
+
| Attention Heads | 6 |
|
66 |
+
|
67 |
+
### Intended Use
|
68 |
+
|
69 |
+
**Intended Use Cases:** ProkBERT-mini-phage is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:
|
70 |
+
- sequence classification tasks
|
71 |
+
- Exploration of genomic patterns and features
|
72 |
+
|
73 |
+
|
74 |
+
### Installation of ProkBERT (if needed)
|
75 |
+
|
76 |
+
For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):
|
77 |
+
|
78 |
+
```python
|
79 |
+
try:
|
80 |
+
import prokbert
|
81 |
+
print("ProkBERT is already installed.")
|
82 |
+
except ImportError:
|
83 |
+
!pip install prokbert
|
84 |
+
print("Installed ProkBERT.")
|
85 |
+
```
|
86 |
+
|
87 |
+
### Training Data and Process
|
88 |
+
|
89 |
+
**Overview:** The model was pretrained on a comprehensive dataset of genomic sequences to ensure broad coverage and robust learning.
|
90 |
+
|
91 |
+
|
92 |
+
*Masking performance of the ProkBERT family.*
|
93 |
+
|
94 |
+
|
95 |
+
### Ethical Considerations and Limitations
|
96 |
+
|
97 |
+
As with all models in the bioinformatics domain, ProkBERT-mini-promoter should be used responsibly. Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.
|
98 |
+
|
99 |
+
### Reporting Issues
|
100 |
+
|
101 |
+
Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:
|
102 |
+
|
103 |
+
- **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
|
104 |
+
- **Feedback and inquiries:** [[email protected]](mailto:[email protected])
|
105 |
+
|
106 |
+
## Reference
|
107 |
+
If you use ProkBERT-mini in your research, please cite the following paper:
|
108 |
+
|
109 |
+
|
110 |
+
```
|
111 |
+
@ARTICLE{10.3389/fmicb.2023.1331233,
|
112 |
+
AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
|
113 |
+
TITLE={ProkBERT family: genomic language models for microbiome applications},
|
114 |
+
JOURNAL={Frontiers in Microbiology},
|
115 |
+
VOLUME={14},
|
116 |
+
YEAR={2024},
|
117 |
+
URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
|
118 |
+
DOI={10.3389/fmicb.2023.1331233},
|
119 |
+
ISSN={1664-302X},
|
120 |
+
ABSTRACT={...}
|
121 |
+
}
|
122 |
+
```
|