Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,36 @@
|
|
1 |
-
---
|
2 |
-
license:
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
library_name: transformers
|
6 |
+
tags:
|
7 |
+
- chemistry
|
8 |
+
- biology
|
9 |
+
---
|
10 |
+
Chemma-2B is a continually pretrained [gemma-2b](https://huggingface.co/google/gemma-2b) model for organic molecules.
|
11 |
+
It is pretrained on (soon-to-be-released) 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties
|
12 |
+
(molecular weight, synthetic accessibility score, drug-likeness etc.)
|
13 |
+
and similarities (Tanimoto distance between ECFP fingerprints).
|
14 |
+
|
15 |
+
Example prompts:
|
16 |
+
|
17 |
+
`</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS]` will attempt to predict the synthetic accessibility score of the given molecule.
|
18 |
+
|
19 |
+
`</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]` will attempt to generate a molecule that has 2.25 SAS score and
|
20 |
+
has a 0.62 similarity score to the given molecule.
|
21 |
+
|
22 |
+
The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts.
|
23 |
+
|
24 |
+
A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization
|
25 |
+
and other benchmarks will be released soon.
|
26 |
+
|
27 |
+
Few notes:
|
28 |
+
* All queries should start with `</s>` symbol.
|
29 |
+
* All numbers are rounded to two decimal points.
|
30 |
+
* All SMILES are canonicalized using `rdkit`.
|
31 |
+
* Available tags: `[CLOGP]`, `[WEIGHT]`, `[QED]`, `[SAS]`, `[TPSA]`, `[RINGCOUNT]`, `[SIMILAR]`...
|
32 |
+
|
33 |
+
The model is part of the 3-model family: [Chemlactica-125M](https://huggingface.co/yerevann/chemlactica-125m),
|
34 |
+
[Chemlactica-1.3B](https://huggingface.co/yerevann/chemlactica-1.3b) and [Chemma-2B](https://huggingface.co/yerevann/chemma-2b).
|
35 |
+
|
36 |
+
We are looking forward to see the community using the model in new applications and contexts.
|