yerevann commited on
Commit
9a726ae
1 Parent(s): 94b80a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -3
README.md CHANGED
@@ -1,3 +1,36 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - chemistry
8
+ - biology
9
+ ---
10
+ Chemma-2B is a continually pretrained [gemma-2b](https://huggingface.co/google/gemma-2b) model for organic molecules.
11
+ It is pretrained on (soon-to-be-released) 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties
12
+ (molecular weight, synthetic accessibility score, drug-likeness etc.)
13
+ and similarities (Tanimoto distance between ECFP fingerprints).
14
+
15
+ Example prompts:
16
+
17
+ `</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS]` will attempt to predict the synthetic accessibility score of the given molecule.
18
+
19
+ `</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]` will attempt to generate a molecule that has 2.25 SAS score and
20
+ has a 0.62 similarity score to the given molecule.
21
+
22
+ The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts.
23
+
24
+ A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization
25
+ and other benchmarks will be released soon.
26
+
27
+ Few notes:
28
+ * All queries should start with `</s>` symbol.
29
+ * All numbers are rounded to two decimal points.
30
+ * All SMILES are canonicalized using `rdkit`.
31
+ * Available tags: `[CLOGP]`, `[WEIGHT]`, `[QED]`, `[SAS]`, `[TPSA]`, `[RINGCOUNT]`, `[SIMILAR]`...
32
+
33
+ The model is part of the 3-model family: [Chemlactica-125M](https://huggingface.co/yerevann/chemlactica-125m),
34
+ [Chemlactica-1.3B](https://huggingface.co/yerevann/chemlactica-1.3b) and [Chemma-2B](https://huggingface.co/yerevann/chemma-2b).
35
+
36
+ We are looking forward to see the community using the model in new applications and contexts.