yerevann commited on
Commit
45f7adb
1 Parent(s): 9a726ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -8,7 +8,7 @@ tags:
8
  - biology
9
  ---
10
  Chemma-2B is a continually pretrained [gemma-2b](https://huggingface.co/google/gemma-2b) model for organic molecules.
11
- It is pretrained on (soon-to-be-released) 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties
12
  (molecular weight, synthetic accessibility score, drug-likeness etc.)
13
  and similarities (Tanimoto distance between ECFP fingerprints).
14
 
@@ -19,10 +19,10 @@ Example prompts:
19
  `</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]` will attempt to generate a molecule that has 2.25 SAS score and
20
  has a 0.62 similarity score to the given molecule.
21
 
22
- The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts.
23
 
24
- A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization
25
- and other benchmarks will be released soon.
26
 
27
  Few notes:
28
  * All queries should start with `</s>` symbol.
 
8
  - biology
9
  ---
10
  Chemma-2B is a continually pretrained [gemma-2b](https://huggingface.co/google/gemma-2b) model for organic molecules.
11
+ It is pretrained on [40B tokens covering 110M+ molecules from PubChem](https://huggingface.co/datasets/yerevann/PubChemForLM) as well as their chemical properties
12
  (molecular weight, synthetic accessibility score, drug-likeness etc.)
13
  and similarities (Tanimoto distance between ECFP fingerprints).
14
 
 
19
  `</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]` will attempt to generate a molecule that has 2.25 SAS score and
20
  has a 0.62 similarity score to the given molecule.
21
 
22
+ The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts. See the [code on GitHub](https://github.com/YerevaNN/ChemLactica).
23
 
24
+ A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on
25
+ Practical Molecular Optimization and other benchmarks is [available on arxiv](https://arxiv.org/abs/2407.18897).
26
 
27
  Few notes:
28
  * All queries should start with `</s>` symbol.