AmelieSchreiber
commited on
Commit
β’
b79663e
1
Parent(s):
86046a8
Update README.md
Browse files
README.md
CHANGED
@@ -26,6 +26,20 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
|
|
26 |
In a couple of weeks, once the transformers library is updated, you should be able to simply use the latest version of transformers
|
27 |
and gradient checkpointing will be fully enabled, and QLoRA compatibility should be fully integrated into ESM-2 models.
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
## QLoRA Info
|
30 |
|
31 |
Note, we are only training 0.58% of the parameters, using only the query, key, and value weight matrices.
|
@@ -65,228 +79,8 @@ Test metrics:
|
|
65 |
'eval_mcc': 0.2535956911257298}
|
66 |
```
|
67 |
|
68 |
-
Metrics for [these datasets](https://github.com/hamzagamouh/pt-lm-gnn)
|
69 |
-
|
70 |
-
```python
|
71 |
-
--------------------------------------------------
|
72 |
-
Processing rows: 100%|ββββββββββ| 54/54 [00:04<00:00, 11.49it/s]
|
73 |
-
Dataset: GTP_Training.txt
|
74 |
-
Accuracy: 0.8777
|
75 |
-
Precision: 0.1488
|
76 |
-
Recall: 0.5517
|
77 |
-
F1 Score: 0.2344
|
78 |
-
AUC: 0.7204
|
79 |
-
MCC: 0.2407
|
80 |
-
--------------------------------------------------
|
81 |
-
Processing rows: 100%|ββββββββββ| 82/82 [00:02<00:00, 32.07it/s]
|
82 |
-
Dataset: GDP_Training.txt
|
83 |
-
Accuracy: 0.8711
|
84 |
-
Precision: 0.1768
|
85 |
-
Recall: 0.6022
|
86 |
-
F1 Score: 0.2733
|
87 |
-
AUC: 0.7423
|
88 |
-
MCC: 0.2768
|
89 |
-
--------------------------------------------------
|
90 |
-
Processing rows: 100%|ββββββββββ| 172/172 [00:06<00:00, 27.98it/s]
|
91 |
-
Dataset: FE_Training.txt
|
92 |
-
Accuracy: 0.8424
|
93 |
-
Precision: 0.0547
|
94 |
-
Recall: 0.5452
|
95 |
-
F1 Score: 0.0994
|
96 |
-
AUC: 0.6962
|
97 |
-
MCC: 0.1344
|
98 |
-
--------------------------------------------------
|
99 |
-
Processing rows: 100%|ββββββββββ| 145/145 [00:04<00:00, 33.86it/s]
|
100 |
-
Dataset: AMP_Training.txt
|
101 |
-
Accuracy: 0.8191
|
102 |
-
Precision: 0.0975
|
103 |
-
Recall: 0.5078
|
104 |
-
F1 Score: 0.1636
|
105 |
-
AUC: 0.6691
|
106 |
-
MCC: 0.1609
|
107 |
-
--------------------------------------------------
|
108 |
-
Processing rows: 100%|ββββββββββ| 206/206 [00:05<00:00, 34.97it/s]
|
109 |
-
Dataset: HEME_Training.txt
|
110 |
-
Accuracy: 0.8561
|
111 |
-
Precision: 0.2089
|
112 |
-
Recall: 0.2795
|
113 |
-
F1 Score: 0.2391
|
114 |
-
AUC: 0.5932
|
115 |
-
MCC: 0.1636
|
116 |
-
--------------------------------------------------
|
117 |
-
Processing rows: 100%|ββββββββββ| 221/221 [00:06<00:00, 31.64it/s]
|
118 |
-
Dataset: ATP_Training.txt
|
119 |
-
Accuracy: 0.8631
|
120 |
-
Precision: 0.1459
|
121 |
-
Recall: 0.4975
|
122 |
-
F1 Score: 0.2256
|
123 |
-
AUC: 0.6879
|
124 |
-
MCC: 0.2146
|
125 |
-
--------------------------------------------------
|
126 |
-
Processing rows: 100%|ββββββββββ| 335/335 [00:10<00:00, 33.49it/s]
|
127 |
-
Dataset: DNA_Training.txt
|
128 |
-
Accuracy: 0.8387
|
129 |
-
Precision: 0.1608
|
130 |
-
Recall: 0.2233
|
131 |
-
F1 Score: 0.1870
|
132 |
-
AUC: 0.5589
|
133 |
-
MCC: 0.1017
|
134 |
-
--------------------------------------------------
|
135 |
-
Processing rows: 100%|ββββββββββ| 296/296 [00:08<00:00, 32.99it/s]
|
136 |
-
Dataset: ADP_Training.txt
|
137 |
-
Accuracy: 0.8653
|
138 |
-
Precision: 0.1415
|
139 |
-
Recall: 0.5142
|
140 |
-
F1 Score: 0.2219
|
141 |
-
AUC: 0.6966
|
142 |
-
MCC: 0.2176
|
143 |
-
--------------------------------------------------
|
144 |
-
Processing rows: 100%|ββββββββββ| 334/334 [00:10<00:00, 31.30it/s]
|
145 |
-
Dataset: MN_Training.txt
|
146 |
-
Accuracy: 0.8507
|
147 |
-
Precision: 0.0488
|
148 |
-
Recall: 0.5602
|
149 |
-
F1 Score: 0.0898
|
150 |
-
AUC: 0.7074
|
151 |
-
MCC: 0.1320
|
152 |
-
--------------------------------------------------
|
153 |
-
Processing rows: 100%|ββββββββββ| 1152/1152 [00:36<00:00, 31.70it/s]
|
154 |
-
Dataset: ZN_Training.txt
|
155 |
-
Accuracy: 0.8418
|
156 |
-
Precision: 0.0437
|
157 |
-
Recall: 0.4674
|
158 |
-
F1 Score: 0.0799
|
159 |
-
AUC: 0.6574
|
160 |
-
MCC: 0.1041
|
161 |
-
--------------------------------------------------
|
162 |
-
Processing rows: 100%|ββββββββββ| 1131/1131 [00:35<00:00, 31.87it/s]
|
163 |
-
Dataset: MG_Training.txt
|
164 |
-
Accuracy: 0.8454
|
165 |
-
Precision: 0.0327
|
166 |
-
Recall: 0.4617
|
167 |
-
F1 Score: 0.0611
|
168 |
-
AUC: 0.6556
|
169 |
-
MCC: 0.0896
|
170 |
-
--------------------------------------------------
|
171 |
-
Processing rows: 100%|ββββββββοΏ½οΏ½β| 961/961 [00:30<00:00, 31.67it/s]
|
172 |
-
Dataset: CA_Training.txt
|
173 |
-
Accuracy: 0.8524
|
174 |
-
Precision: 0.0251
|
175 |
-
Recall: 0.2057
|
176 |
-
F1 Score: 0.0447
|
177 |
-
AUC: 0.5346
|
178 |
-
MCC: 0.0258
|
179 |
-
```
|
180 |
-
```python
|
181 |
-
--------------------------------------------------
|
182 |
-
Processing rows: 100%|ββββββββββ| 27/27 [00:01<00:00, 26.47it/s]
|
183 |
-
Dataset: HEME_Validation.txt
|
184 |
-
Accuracy: 0.8891
|
185 |
-
Precision: 0.2125
|
186 |
-
Recall: 0.2810
|
187 |
-
F1 Score: 0.2420
|
188 |
-
AUC: 0.6055
|
189 |
-
MCC: 0.1855
|
190 |
-
--------------------------------------------------
|
191 |
-
Processing rows: 100%|ββββββββββ| 7/7 [00:00<00:00, 20.36it/s]
|
192 |
-
Dataset: GTP_Validation.txt
|
193 |
-
Accuracy: 0.8012
|
194 |
-
Precision: 0.1377
|
195 |
-
Recall: 0.6404
|
196 |
-
F1 Score: 0.2266
|
197 |
-
AUC: 0.7247
|
198 |
-
MCC: 0.2292
|
199 |
-
--------------------------------------------------
|
200 |
-
Processing rows: 100%|ββββββββββ| 14/14 [00:00<00:00, 17.96it/s]
|
201 |
-
Dataset: GDP_Validation.txt
|
202 |
-
Accuracy: 0.7954
|
203 |
-
Precision: 0.1456
|
204 |
-
Recall: 0.7423
|
205 |
-
F1 Score: 0.2434
|
206 |
-
AUC: 0.7701
|
207 |
-
MCC: 0.2658
|
208 |
-
--------------------------------------------------
|
209 |
-
Processing rows: 100%|ββββββββββ| 26/26 [00:00<00:00, 27.91it/s]
|
210 |
-
Dataset: FE_Validation.txt
|
211 |
-
Accuracy: 0.8523
|
212 |
-
Precision: 0.0571
|
213 |
-
Recall: 0.6667
|
214 |
-
F1 Score: 0.1052
|
215 |
-
AUC: 0.7607
|
216 |
-
MCC: 0.1646
|
217 |
-
--------------------------------------------------
|
218 |
-
Processing rows: 100%|ββββββββββ| 58/58 [00:01<00:00, 30.49it/s]
|
219 |
-
Dataset: MN_Validation.txt
|
220 |
-
Accuracy: 0.8445
|
221 |
-
Precision: 0.0458
|
222 |
-
Recall: 0.5359
|
223 |
-
F1 Score: 0.0844
|
224 |
-
AUC: 0.6923
|
225 |
-
MCC: 0.1216
|
226 |
-
--------------------------------------------------
|
227 |
-
Processing rows: 100%|ββββββββββ| 33/33 [00:00<00:00, 34.34it/s]
|
228 |
-
Dataset: AMP_Validation.txt
|
229 |
-
Accuracy: 0.8116
|
230 |
-
Precision: 0.1065
|
231 |
-
Recall: 0.5638
|
232 |
-
F1 Score: 0.1792
|
233 |
-
AUC: 0.6924
|
234 |
-
MCC: 0.1827
|
235 |
-
--------------------------------------------------
|
236 |
-
Processing rows: 100%|ββββββββββ| 52/52 [00:01<00:00, 32.70it/s]
|
237 |
-
Dataset: DNA_Validation.txt
|
238 |
-
Accuracy: 0.8849
|
239 |
-
Precision: 0.1306
|
240 |
-
Recall: 0.1829
|
241 |
-
F1 Score: 0.1524
|
242 |
-
AUC: 0.5550
|
243 |
-
MCC: 0.0940
|
244 |
-
--------------------------------------------------
|
245 |
-
Processing rows: 100%|ββββββββββ| 50/50 [00:01<00:00, 33.79it/s]
|
246 |
-
Dataset: ATP_Validation.txt
|
247 |
-
Accuracy: 0.8497
|
248 |
-
Precision: 0.1220
|
249 |
-
Recall: 0.4869
|
250 |
-
F1 Score: 0.1952
|
251 |
-
AUC: 0.6753
|
252 |
-
MCC: 0.1868
|
253 |
-
--------------------------------------------------
|
254 |
-
Processing rows: 100%|ββββββββββ| 47/47 [00:01<00:00, 31.43it/s]
|
255 |
-
Dataset: ADP_Validation.txt
|
256 |
-
Accuracy: 0.8652
|
257 |
-
Precision: 0.1279
|
258 |
-
Recall: 0.5379
|
259 |
-
F1 Score: 0.2067
|
260 |
-
AUC: 0.7071
|
261 |
-
MCC: 0.2139
|
262 |
-
--------------------------------------------------
|
263 |
-
Processing rows: 100%|ββββββββββ| 176/176 [00:05<00:00, 32.21it/s]
|
264 |
-
Dataset: ZN_Validation.txt
|
265 |
-
Accuracy: 0.8486
|
266 |
-
Precision: 0.0461
|
267 |
-
Recall: 0.4516
|
268 |
-
F1 Score: 0.0837
|
269 |
-
AUC: 0.6532
|
270 |
-
MCC: 0.1054
|
271 |
-
--------------------------------------------------
|
272 |
-
Processing rows: 100%|ββββββββββ| 165/165 [00:05<00:00, 32.32it/s]
|
273 |
-
Dataset: CA_Validation.txt
|
274 |
-
Accuracy: 0.8577
|
275 |
-
Precision: 0.0263
|
276 |
-
Recall: 0.2471
|
277 |
-
F1 Score: 0.0476
|
278 |
-
AUC: 0.5568
|
279 |
-
MCC: 0.0396
|
280 |
-
--------------------------------------------------
|
281 |
-
Processing rows: 100%|ββββββββββ| 217/217 [00:06<00:00, 33.25it/s]
|
282 |
-
Dataset: MG_Validation.txt
|
283 |
-
Accuracy: 0.8572
|
284 |
-
Precision: 0.0297
|
285 |
-
Recall: 0.3533
|
286 |
-
F1 Score: 0.0547
|
287 |
-
AUC: 0.6082
|
288 |
-
MCC: 0.0672
|
289 |
-
```
|
290 |
|
291 |
### Checkpoint 4
|
292 |
|
|
|
26 |
In a couple of weeks, once the transformers library is updated, you should be able to simply use the latest version of transformers
|
27 |
and gradient checkpointing will be fully enabled, and QLoRA compatibility should be fully integrated into ESM-2 models.
|
28 |
|
29 |
+
## Data Curation and Preprocessing
|
30 |
+
|
31 |
+
To create your own datasets and perform the same data preprocessing as was used for this project, you will need to download a TSV file
|
32 |
+
from UniProt with the following columns (Protein families, Binding sites, Active sites, Protein sequence), and then you can use
|
33 |
+
[this notebook](https://huggingface.co/AmelieSchreiber/esm2_t6_8m_qlora_binding_sites_v0/blob/main/data_processing_v1.ipynb) for
|
34 |
+
separating out the test sequences by choosing random families to use (including all sequences in that family, with no overlap in with
|
35 |
+
the training data), filtering out proteins with incomplete annotations, merging the binding and active sites, converting them to binary
|
36 |
+
labels (`0` for non-binding sites, `1` for binding sites), and splitting the sequences into non-overlapping chunks of 1000 residues or
|
37 |
+
less to accomodate the 1022 sized context window of ESM-2 models. This notebook will also allow you to reduce the size of your dataset
|
38 |
+
at the end. Note, this step is not currently ideal as it only selects proteins at random from the train and test datasets to keep and does
|
39 |
+
not take into account that proteins from small families are less likely to be chosen, biasing the models towards larger families. Due to
|
40 |
+
this shortcoming in our data preprocessing step, smaller models trained on smaller datasets are likely biased towards larger families.
|
41 |
+
Perhaps an approach that is biased towards smaller families would be better.
|
42 |
+
|
43 |
## QLoRA Info
|
44 |
|
45 |
Note, we are only training 0.58% of the parameters, using only the query, key, and value weight matrices.
|
|
|
79 |
'eval_mcc': 0.2535956911257298}
|
80 |
```
|
81 |
|
82 |
+
Metrics for this checkpoint for [these datasets](https://github.com/hamzagamouh/pt-lm-gnn) can be
|
83 |
+
[found here](https://huggingface.co/AmelieSchreiber/esm2_t6_8m_qlora_binding_sites_v0/blob/main/pdb_struct_metrics.txt).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
|
85 |
### Checkpoint 4
|
86 |
|