Update README.md
Browse files
README.md
CHANGED
@@ -389,6 +389,10 @@ The process involved the following steps:
|
|
389 |
- Error handling was implemented for cases where molecules couldn't be processed.
|
390 |
- The labels were resampled to ensure a balanced range of examples ([0,1])
|
391 |
- The similarity scores were rounded to two decimal places for training.
|
|
|
|
|
|
|
|
|
392 |
|
393 |
This methodology aims to provide a diverse set of molecule pairs with labels indicating structural similarity. The combination of complexity binning, balanced inter- and intra-strata sampling, and MACCS fingerprint similarity labeling is intended to capture a range of molecular complexities while providing chemically relevant labels for model training.
|
394 |
|
@@ -530,7 +534,8 @@ then query similars based on the average embeddings (returned in 3.5s):
|
|
530 |
(WIP)
|
531 |
|
532 |
## Validation by Docking-based Virtual Screening
|
533 |
-
Validation by Docking-based Virtual Screening (DBVS) is finished
|
|
|
534 |
Detailed results and methodologies will be fully disclosed after the thesis is published.
|
535 |
|
536 |
## Testing Generated Embeddings' Clusters
|
|
|
389 |
- Error handling was implemented for cases where molecules couldn't be processed.
|
390 |
- The labels were resampled to ensure a balanced range of examples ([0,1])
|
391 |
- The similarity scores were rounded to two decimal places for training.
|
392 |
+
9. **Splitting**:
|
393 |
+
- The dataset was split into 80% for training, 10% for validation, and 10% for testing.
|
394 |
+
- For the Natural Products set (NP), the test set is labeled "NP-iso-base" because the SMILES strings were not canonicalized (isomeric forms were retained).
|
395 |
+
- The test set from ChemBL34 was then combined with the NP test set, and this combined set is referred to as "combined."
|
396 |
|
397 |
This methodology aims to provide a diverse set of molecule pairs with labels indicating structural similarity. The combination of complexity binning, balanced inter- and intra-strata sampling, and MACCS fingerprint similarity labeling is intended to capture a range of molecular complexities while providing chemically relevant labels for model training.
|
398 |
|
|
|
534 |
(WIP)
|
535 |
|
536 |
## Validation by Docking-based Virtual Screening
|
537 |
+
Validation by Docking-based Virtual Screening (DBVS) is finished, showing promising hit rates — ranging from 26% to 58% within the top 100 hits using averaged embeddings of nAChR α4β2 partial agonists, depending on the threshold applied.
|
538 |
+
However, some of the methods used are adapted from my undergraduate thesis, which is still in progress and pending publication.
|
539 |
Detailed results and methodologies will be fully disclosed after the thesis is published.
|
540 |
|
541 |
## Testing Generated Embeddings' Clusters
|