AmelieSchreiber commited on
Commit
b79663e
β€’
1 Parent(s): 86046a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -222
README.md CHANGED
@@ -26,6 +26,20 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
26
  In a couple of weeks, once the transformers library is updated, you should be able to simply use the latest version of transformers
27
  and gradient checkpointing will be fully enabled, and QLoRA compatibility should be fully integrated into ESM-2 models.
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## QLoRA Info
30
 
31
  Note, we are only training 0.58% of the parameters, using only the query, key, and value weight matrices.
@@ -65,228 +79,8 @@ Test metrics:
65
  'eval_mcc': 0.2535956911257298}
66
  ```
67
 
68
- Metrics for [these datasets](https://github.com/hamzagamouh/pt-lm-gnn):
69
-
70
- ```python
71
- --------------------------------------------------
72
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 54/54 [00:04<00:00, 11.49it/s]
73
- Dataset: GTP_Training.txt
74
- Accuracy: 0.8777
75
- Precision: 0.1488
76
- Recall: 0.5517
77
- F1 Score: 0.2344
78
- AUC: 0.7204
79
- MCC: 0.2407
80
- --------------------------------------------------
81
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 82/82 [00:02<00:00, 32.07it/s]
82
- Dataset: GDP_Training.txt
83
- Accuracy: 0.8711
84
- Precision: 0.1768
85
- Recall: 0.6022
86
- F1 Score: 0.2733
87
- AUC: 0.7423
88
- MCC: 0.2768
89
- --------------------------------------------------
90
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 172/172 [00:06<00:00, 27.98it/s]
91
- Dataset: FE_Training.txt
92
- Accuracy: 0.8424
93
- Precision: 0.0547
94
- Recall: 0.5452
95
- F1 Score: 0.0994
96
- AUC: 0.6962
97
- MCC: 0.1344
98
- --------------------------------------------------
99
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 145/145 [00:04<00:00, 33.86it/s]
100
- Dataset: AMP_Training.txt
101
- Accuracy: 0.8191
102
- Precision: 0.0975
103
- Recall: 0.5078
104
- F1 Score: 0.1636
105
- AUC: 0.6691
106
- MCC: 0.1609
107
- --------------------------------------------------
108
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 206/206 [00:05<00:00, 34.97it/s]
109
- Dataset: HEME_Training.txt
110
- Accuracy: 0.8561
111
- Precision: 0.2089
112
- Recall: 0.2795
113
- F1 Score: 0.2391
114
- AUC: 0.5932
115
- MCC: 0.1636
116
- --------------------------------------------------
117
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 221/221 [00:06<00:00, 31.64it/s]
118
- Dataset: ATP_Training.txt
119
- Accuracy: 0.8631
120
- Precision: 0.1459
121
- Recall: 0.4975
122
- F1 Score: 0.2256
123
- AUC: 0.6879
124
- MCC: 0.2146
125
- --------------------------------------------------
126
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 335/335 [00:10<00:00, 33.49it/s]
127
- Dataset: DNA_Training.txt
128
- Accuracy: 0.8387
129
- Precision: 0.1608
130
- Recall: 0.2233
131
- F1 Score: 0.1870
132
- AUC: 0.5589
133
- MCC: 0.1017
134
- --------------------------------------------------
135
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 296/296 [00:08<00:00, 32.99it/s]
136
- Dataset: ADP_Training.txt
137
- Accuracy: 0.8653
138
- Precision: 0.1415
139
- Recall: 0.5142
140
- F1 Score: 0.2219
141
- AUC: 0.6966
142
- MCC: 0.2176
143
- --------------------------------------------------
144
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 334/334 [00:10<00:00, 31.30it/s]
145
- Dataset: MN_Training.txt
146
- Accuracy: 0.8507
147
- Precision: 0.0488
148
- Recall: 0.5602
149
- F1 Score: 0.0898
150
- AUC: 0.7074
151
- MCC: 0.1320
152
- --------------------------------------------------
153
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1152/1152 [00:36<00:00, 31.70it/s]
154
- Dataset: ZN_Training.txt
155
- Accuracy: 0.8418
156
- Precision: 0.0437
157
- Recall: 0.4674
158
- F1 Score: 0.0799
159
- AUC: 0.6574
160
- MCC: 0.1041
161
- --------------------------------------------------
162
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1131/1131 [00:35<00:00, 31.87it/s]
163
- Dataset: MG_Training.txt
164
- Accuracy: 0.8454
165
- Precision: 0.0327
166
- Recall: 0.4617
167
- F1 Score: 0.0611
168
- AUC: 0.6556
169
- MCC: 0.0896
170
- --------------------------------------------------
171
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½οΏ½β–ˆ| 961/961 [00:30<00:00, 31.67it/s]
172
- Dataset: CA_Training.txt
173
- Accuracy: 0.8524
174
- Precision: 0.0251
175
- Recall: 0.2057
176
- F1 Score: 0.0447
177
- AUC: 0.5346
178
- MCC: 0.0258
179
- ```
180
- ```python
181
- --------------------------------------------------
182
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 27/27 [00:01<00:00, 26.47it/s]
183
- Dataset: HEME_Validation.txt
184
- Accuracy: 0.8891
185
- Precision: 0.2125
186
- Recall: 0.2810
187
- F1 Score: 0.2420
188
- AUC: 0.6055
189
- MCC: 0.1855
190
- --------------------------------------------------
191
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:00<00:00, 20.36it/s]
192
- Dataset: GTP_Validation.txt
193
- Accuracy: 0.8012
194
- Precision: 0.1377
195
- Recall: 0.6404
196
- F1 Score: 0.2266
197
- AUC: 0.7247
198
- MCC: 0.2292
199
- --------------------------------------------------
200
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 14/14 [00:00<00:00, 17.96it/s]
201
- Dataset: GDP_Validation.txt
202
- Accuracy: 0.7954
203
- Precision: 0.1456
204
- Recall: 0.7423
205
- F1 Score: 0.2434
206
- AUC: 0.7701
207
- MCC: 0.2658
208
- --------------------------------------------------
209
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 26/26 [00:00<00:00, 27.91it/s]
210
- Dataset: FE_Validation.txt
211
- Accuracy: 0.8523
212
- Precision: 0.0571
213
- Recall: 0.6667
214
- F1 Score: 0.1052
215
- AUC: 0.7607
216
- MCC: 0.1646
217
- --------------------------------------------------
218
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 58/58 [00:01<00:00, 30.49it/s]
219
- Dataset: MN_Validation.txt
220
- Accuracy: 0.8445
221
- Precision: 0.0458
222
- Recall: 0.5359
223
- F1 Score: 0.0844
224
- AUC: 0.6923
225
- MCC: 0.1216
226
- --------------------------------------------------
227
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 33/33 [00:00<00:00, 34.34it/s]
228
- Dataset: AMP_Validation.txt
229
- Accuracy: 0.8116
230
- Precision: 0.1065
231
- Recall: 0.5638
232
- F1 Score: 0.1792
233
- AUC: 0.6924
234
- MCC: 0.1827
235
- --------------------------------------------------
236
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 52/52 [00:01<00:00, 32.70it/s]
237
- Dataset: DNA_Validation.txt
238
- Accuracy: 0.8849
239
- Precision: 0.1306
240
- Recall: 0.1829
241
- F1 Score: 0.1524
242
- AUC: 0.5550
243
- MCC: 0.0940
244
- --------------------------------------------------
245
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:01<00:00, 33.79it/s]
246
- Dataset: ATP_Validation.txt
247
- Accuracy: 0.8497
248
- Precision: 0.1220
249
- Recall: 0.4869
250
- F1 Score: 0.1952
251
- AUC: 0.6753
252
- MCC: 0.1868
253
- --------------------------------------------------
254
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 47/47 [00:01<00:00, 31.43it/s]
255
- Dataset: ADP_Validation.txt
256
- Accuracy: 0.8652
257
- Precision: 0.1279
258
- Recall: 0.5379
259
- F1 Score: 0.2067
260
- AUC: 0.7071
261
- MCC: 0.2139
262
- --------------------------------------------------
263
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 176/176 [00:05<00:00, 32.21it/s]
264
- Dataset: ZN_Validation.txt
265
- Accuracy: 0.8486
266
- Precision: 0.0461
267
- Recall: 0.4516
268
- F1 Score: 0.0837
269
- AUC: 0.6532
270
- MCC: 0.1054
271
- --------------------------------------------------
272
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 165/165 [00:05<00:00, 32.32it/s]
273
- Dataset: CA_Validation.txt
274
- Accuracy: 0.8577
275
- Precision: 0.0263
276
- Recall: 0.2471
277
- F1 Score: 0.0476
278
- AUC: 0.5568
279
- MCC: 0.0396
280
- --------------------------------------------------
281
- Processing rows: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 217/217 [00:06<00:00, 33.25it/s]
282
- Dataset: MG_Validation.txt
283
- Accuracy: 0.8572
284
- Precision: 0.0297
285
- Recall: 0.3533
286
- F1 Score: 0.0547
287
- AUC: 0.6082
288
- MCC: 0.0672
289
- ```
290
 
291
  ### Checkpoint 4
292
 
 
26
  In a couple of weeks, once the transformers library is updated, you should be able to simply use the latest version of transformers
27
  and gradient checkpointing will be fully enabled, and QLoRA compatibility should be fully integrated into ESM-2 models.
28
 
29
+ ## Data Curation and Preprocessing
30
+
31
+ To create your own datasets and perform the same data preprocessing as was used for this project, you will need to download a TSV file
32
+ from UniProt with the following columns (Protein families, Binding sites, Active sites, Protein sequence), and then you can use
33
+ [this notebook](https://huggingface.co/AmelieSchreiber/esm2_t6_8m_qlora_binding_sites_v0/blob/main/data_processing_v1.ipynb) for
34
+ separating out the test sequences by choosing random families to use (including all sequences in that family, with no overlap in with
35
+ the training data), filtering out proteins with incomplete annotations, merging the binding and active sites, converting them to binary
36
+ labels (`0` for non-binding sites, `1` for binding sites), and splitting the sequences into non-overlapping chunks of 1000 residues or
37
+ less to accomodate the 1022 sized context window of ESM-2 models. This notebook will also allow you to reduce the size of your dataset
38
+ at the end. Note, this step is not currently ideal as it only selects proteins at random from the train and test datasets to keep and does
39
+ not take into account that proteins from small families are less likely to be chosen, biasing the models towards larger families. Due to
40
+ this shortcoming in our data preprocessing step, smaller models trained on smaller datasets are likely biased towards larger families.
41
+ Perhaps an approach that is biased towards smaller families would be better.
42
+
43
  ## QLoRA Info
44
 
45
  Note, we are only training 0.58% of the parameters, using only the query, key, and value weight matrices.
 
79
  'eval_mcc': 0.2535956911257298}
80
  ```
81
 
82
+ Metrics for this checkpoint for [these datasets](https://github.com/hamzagamouh/pt-lm-gnn) can be
83
+ [found here](https://huggingface.co/AmelieSchreiber/esm2_t6_8m_qlora_binding_sites_v0/blob/main/pdb_struct_metrics.txt).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
  ### Checkpoint 4
86