gbyuvd commited on
Commit
8ee5602
1 Parent(s): c8e44c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -389,6 +389,10 @@ The process involved the following steps:
389
  - Error handling was implemented for cases where molecules couldn't be processed.
390
  - The labels were resampled to ensure a balanced range of examples ([0,1])
391
  - The similarity scores were rounded to two decimal places for training.
 
 
 
 
392
 
393
  This methodology aims to provide a diverse set of molecule pairs with labels indicating structural similarity. The combination of complexity binning, balanced inter- and intra-strata sampling, and MACCS fingerprint similarity labeling is intended to capture a range of molecular complexities while providing chemically relevant labels for model training.
394
 
@@ -530,7 +534,8 @@ then query similars based on the average embeddings (returned in 3.5s):
530
  (WIP)
531
 
532
  ## Validation by Docking-based Virtual Screening
533
- Validation by Docking-based Virtual Screening (DBVS) is finished with promising hit rates, but some of the methods used are adapted from my undergraduate thesis, which is still in progress and pending publication.
 
534
  Detailed results and methodologies will be fully disclosed after the thesis is published.
535
 
536
  ## Testing Generated Embeddings' Clusters
 
389
  - Error handling was implemented for cases where molecules couldn't be processed.
390
  - The labels were resampled to ensure a balanced range of examples ([0,1])
391
  - The similarity scores were rounded to two decimal places for training.
392
+ 9. **Splitting**:
393
+ - The dataset was split into 80% for training, 10% for validation, and 10% for testing.
394
+ - For the Natural Products set (NP), the test set is labeled "NP-iso-base" because the SMILES strings were not canonicalized (isomeric forms were retained).
395
+ - The test set from ChemBL34 was then combined with the NP test set, and this combined set is referred to as "combined."
396
 
397
  This methodology aims to provide a diverse set of molecule pairs with labels indicating structural similarity. The combination of complexity binning, balanced inter- and intra-strata sampling, and MACCS fingerprint similarity labeling is intended to capture a range of molecular complexities while providing chemically relevant labels for model training.
398
 
 
534
  (WIP)
535
 
536
  ## Validation by Docking-based Virtual Screening
537
+ Validation by Docking-based Virtual Screening (DBVS) is finished, showing promising hit rates ranging from 26% to 58% within the top 100 hits using averaged embeddings of nAChR α4β2 partial agonists, depending on the threshold applied.
538
+ However, some of the methods used are adapted from my undergraduate thesis, which is still in progress and pending publication.
539
  Detailed results and methodologies will be fully disclosed after the thesis is published.
540
 
541
  ## Testing Generated Embeddings' Clusters