yerevann
/

chemma-2b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

chemma-2b / README.md

yerevann's picture

Update README.md

45f7adb verified about 2 months ago

|

1.89 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	library_name: transformers
	tags:
	- chemistry
	- biology
	---
	Chemma-2B is a continually pretrained [gemma-2b](https://huggingface.co/google/gemma-2b) model for organic molecules.
	It is pretrained on [40B tokens covering 110M+ molecules from PubChem](https://huggingface.co/datasets/yerevann/PubChemForLM) as well as their chemical properties
	(molecular weight, synthetic accessibility score, drug-likeness etc.)
	and similarities (Tanimoto distance between ECFP fingerprints).

	Example prompts:

	`</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS]` will attempt to predict the synthetic accessibility score of the given molecule.

	`</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]` will attempt to generate a molecule that has 2.25 SAS score and
	has a 0.62 similarity score to the given molecule.

	The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts. See the [code on GitHub](https://github.com/YerevaNN/ChemLactica).

	A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on
	Practical Molecular Optimization and other benchmarks is [available on arxiv](https://arxiv.org/abs/2407.18897).

	Few notes:
	* All queries should start with `</s>` symbol.
	* All numbers are rounded to two decimal points.
	* All SMILES are canonicalized using `rdkit`.
	* Available tags: `[CLOGP]`, `[WEIGHT]`, `[QED]`, `[SAS]`, `[TPSA]`, `[RINGCOUNT]`, `[SIMILAR]`...

	The model is part of the 3-model family: [Chemlactica-125M](https://huggingface.co/yerevann/chemlactica-125m),
	[Chemlactica-1.3B](https://huggingface.co/yerevann/chemlactica-1.3b) and [Chemma-2B](https://huggingface.co/yerevann/chemma-2b).

	We are looking forward to see the community using the model in new applications and contexts.