AmelieSchreiber
/

esm2_t6_8m_qlora_binding_sites_v0

@@ -26,6 +26,20 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
 In a couple of weeks, once the transformers library is updated, you should be able to simply use the latest version of transformers
 and gradient checkpointing will be fully enabled, and QLoRA compatibility should be fully integrated into ESM-2 models.
 ## QLoRA Info
 Note, we are only training 0.58% of the parameters, using only the query, key, and value weight matrices.
@@ -65,228 +79,8 @@ Test metrics:
 'eval_mcc': 0.2535956911257298}
 ```
-Metrics for [these datasets](https://github.com/hamzagamouh/pt-lm-gnn):
-```python
---------------------------------------------------
-Processing rows: 100%|██████████| 54/54 [00:04<00:00, 11.49it/s]
-Dataset: GTP_Training.txt
-Accuracy: 0.8777
-Precision: 0.1488
-Recall: 0.5517
-F1 Score: 0.2344
-AUC: 0.7204
-MCC: 0.2407
---------------------------------------------------
-Processing rows: 100%|██████████| 82/82 [00:02<00:00, 32.07it/s]
-Dataset: GDP_Training.txt
-Accuracy: 0.8711
-Precision: 0.1768
-Recall: 0.6022
-F1 Score: 0.2733
-AUC: 0.7423
-MCC: 0.2768
---------------------------------------------------
-Processing rows: 100%|██████████| 172/172 [00:06<00:00, 27.98it/s]
-Dataset: FE_Training.txt
-Accuracy: 0.8424
-Precision: 0.0547
-Recall: 0.5452
-F1 Score: 0.0994
-AUC: 0.6962
-MCC: 0.1344
---------------------------------------------------
-Processing rows: 100%|██████████| 145/145 [00:04<00:00, 33.86it/s]
-Dataset: AMP_Training.txt
-Accuracy: 0.8191
-Precision: 0.0975
-Recall: 0.5078
-F1 Score: 0.1636
-AUC: 0.6691
-MCC: 0.1609
---------------------------------------------------
-Processing rows: 100%|██████████| 206/206 [00:05<00:00, 34.97it/s]
-Dataset: HEME_Training.txt
-Accuracy: 0.8561
-Precision: 0.2089
-Recall: 0.2795
-F1 Score: 0.2391
-AUC: 0.5932
-MCC: 0.1636
---------------------------------------------------
-Processing rows: 100%|██████████| 221/221 [00:06<00:00, 31.64it/s]
-Dataset: ATP_Training.txt
-Accuracy: 0.8631
-Precision: 0.1459
-Recall: 0.4975
-F1 Score: 0.2256
-AUC: 0.6879
-MCC: 0.2146
---------------------------------------------------
-Processing rows: 100%|██████████| 335/335 [00:10<00:00, 33.49it/s]
-Dataset: DNA_Training.txt
-Accuracy: 0.8387
-Precision: 0.1608
-Recall: 0.2233
-F1 Score: 0.1870
-AUC: 0.5589
-MCC: 0.1017
---------------------------------------------------
-Processing rows: 100%|██████████| 296/296 [00:08<00:00, 32.99it/s]
-Dataset: ADP_Training.txt
-Accuracy: 0.8653
-Precision: 0.1415
-Recall: 0.5142
-F1 Score: 0.2219
-AUC: 0.6966
-MCC: 0.2176
---------------------------------------------------
-Processing rows: 100%|██████████| 334/334 [00:10<00:00, 31.30it/s]
-Dataset: MN_Training.txt
-Accuracy: 0.8507
-Precision: 0.0488
-Recall: 0.5602
-F1 Score: 0.0898
-AUC: 0.7074
-MCC: 0.1320
---------------------------------------------------
-Processing rows: 100%|██████████| 1152/1152 [00:36<00:00, 31.70it/s]
-Dataset: ZN_Training.txt
-Accuracy: 0.8418
-Precision: 0.0437
-Recall: 0.4674
-F1 Score: 0.0799
-AUC: 0.6574
-MCC: 0.1041
---------------------------------------------------
-Processing rows: 100%|██████████| 1131/1131 [00:35<00:00, 31.87it/s]
-Dataset: MG_Training.txt
-Accuracy: 0.8454
-Precision: 0.0327
-Recall: 0.4617
-F1 Score: 0.0611
-AUC: 0.6556
-MCC: 0.0896
---------------------------------------------------
-Processing rows: 100%|████████��█| 961/961 [00:30<00:00, 31.67it/s]
-Dataset: CA_Training.txt
-Accuracy: 0.8524
-Precision: 0.0251
-Recall: 0.2057
-F1 Score: 0.0447
-AUC: 0.5346
-MCC: 0.0258
-```
-```python
---------------------------------------------------
-Processing rows: 100%|██████████| 27/27 [00:01<00:00, 26.47it/s]
-Dataset: HEME_Validation.txt
-Accuracy: 0.8891
-Precision: 0.2125
-Recall: 0.2810
-F1 Score: 0.2420
-AUC: 0.6055
-MCC: 0.1855
---------------------------------------------------
-Processing rows: 100%|██████████| 7/7 [00:00<00:00, 20.36it/s]
-Dataset: GTP_Validation.txt
-Accuracy: 0.8012
-Precision: 0.1377
-Recall: 0.6404
-F1 Score: 0.2266
-AUC: 0.7247
-MCC: 0.2292
---------------------------------------------------
-Processing rows: 100%|██████████| 14/14 [00:00<00:00, 17.96it/s]
-Dataset: GDP_Validation.txt
-Accuracy: 0.7954
-Precision: 0.1456
-Recall: 0.7423
-F1 Score: 0.2434
-AUC: 0.7701
-MCC: 0.2658
---------------------------------------------------
-Processing rows: 100%|██████████| 26/26 [00:00<00:00, 27.91it/s]
-Dataset: FE_Validation.txt
-Accuracy: 0.8523
-Precision: 0.0571
-Recall: 0.6667
-F1 Score: 0.1052
-AUC: 0.7607
-MCC: 0.1646
---------------------------------------------------
-Processing rows: 100%|██████████| 58/58 [00:01<00:00, 30.49it/s]
-Dataset: MN_Validation.txt
-Accuracy: 0.8445
-Precision: 0.0458
-Recall: 0.5359
-F1 Score: 0.0844
-AUC: 0.6923
-MCC: 0.1216
---------------------------------------------------
-Processing rows: 100%|██████████| 33/33 [00:00<00:00, 34.34it/s]
-Dataset: AMP_Validation.txt
-Accuracy: 0.8116
-Precision: 0.1065
-Recall: 0.5638
-F1 Score: 0.1792
-AUC: 0.6924
-MCC: 0.1827
---------------------------------------------------
-Processing rows: 100%|██████████| 52/52 [00:01<00:00, 32.70it/s]
-Dataset: DNA_Validation.txt
-Accuracy: 0.8849
-Precision: 0.1306
-Recall: 0.1829
-F1 Score: 0.1524
-AUC: 0.5550
-MCC: 0.0940
---------------------------------------------------
-Processing rows: 100%|██████████| 50/50 [00:01<00:00, 33.79it/s]
-Dataset: ATP_Validation.txt
-Accuracy: 0.8497
-Precision: 0.1220
-Recall: 0.4869
-F1 Score: 0.1952
-AUC: 0.6753
-MCC: 0.1868
---------------------------------------------------
-Processing rows: 100%|██████████| 47/47 [00:01<00:00, 31.43it/s]
-Dataset: ADP_Validation.txt
-Accuracy: 0.8652
-Precision: 0.1279
-Recall: 0.5379
-F1 Score: 0.2067
-AUC: 0.7071
-MCC: 0.2139
---------------------------------------------------
-Processing rows: 100%|██████████| 176/176 [00:05<00:00, 32.21it/s]
-Dataset: ZN_Validation.txt
-Accuracy: 0.8486
-Precision: 0.0461
-Recall: 0.4516
-F1 Score: 0.0837
-AUC: 0.6532
-MCC: 0.1054
---------------------------------------------------
-Processing rows: 100%|██████████| 165/165 [00:05<00:00, 32.32it/s]
-Dataset: CA_Validation.txt
-Accuracy: 0.8577
-Precision: 0.0263
-Recall: 0.2471
-F1 Score: 0.0476
-AUC: 0.5568
-MCC: 0.0396
---------------------------------------------------
-Processing rows: 100%|██████████| 217/217 [00:06<00:00, 33.25it/s]
-Dataset: MG_Validation.txt
-Accuracy: 0.8572
-Precision: 0.0297
-Recall: 0.3533
-F1 Score: 0.0547
-AUC: 0.6082
-MCC: 0.0672
-```
 ### Checkpoint 4

 In a couple of weeks, once the transformers library is updated, you should be able to simply use the latest version of transformers
 and gradient checkpointing will be fully enabled, and QLoRA compatibility should be fully integrated into ESM-2 models.
+## Data Curation and Preprocessing
+To create your own datasets and perform the same data preprocessing as was used for this project, you will need to download a TSV file
+from UniProt with the following columns (Protein families, Binding sites, Active sites, Protein sequence), and then you can use
+[this notebook](https://huggingface.co/AmelieSchreiber/esm2_t6_8m_qlora_binding_sites_v0/blob/main/data_processing_v1.ipynb) for
+separating out the test sequences by choosing random families to use (including all sequences in that family, with no overlap in with
+the training data), filtering out proteins with incomplete annotations, merging the binding and active sites, converting them to binary
+labels (`0` for non-binding sites, `1` for binding sites), and splitting the sequences into non-overlapping chunks of 1000 residues or
+less to accomodate the 1022 sized context window of ESM-2 models. This notebook will also allow you to reduce the size of your dataset
+at the end. Note, this step is not currently ideal as it only selects proteins at random from the train and test datasets to keep and does
+not take into account that proteins from small families are less likely to be chosen, biasing the models towards larger families. Due to
+this shortcoming in our data preprocessing step, smaller models trained on smaller datasets are likely biased towards larger families.
+Perhaps an approach that is biased towards smaller families would be better.
 ## QLoRA Info
 Note, we are only training 0.58% of the parameters, using only the query, key, and value weight matrices.
 'eval_mcc': 0.2535956911257298}
 ```
+Metrics for this checkpoint for [these datasets](https://github.com/hamzagamouh/pt-lm-gnn) can be
+[found here](https://huggingface.co/AmelieSchreiber/esm2_t6_8m_qlora_binding_sites_v0/blob/main/pdb_struct_metrics.txt).
 ### Checkpoint 4