zulfatmi commited on
Commit
9aec218
1 Parent(s): e9be25e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md CHANGED
@@ -1,3 +1,89 @@
1
  ---
2
  license: cc-by-nc-4.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - chemistry
7
  ---
8
+
9
+ <h1 align="center"> nach0 </h1>
10
+ <h3 align="center"> Multimodal Natural and Chemical Languages Foundation Model </h3>
11
+ <p align="center">
12
+ 📃 <a href="https://arxiv.org/abs/2311.12410" target="_blank">Paper</a> • ⏬ <a href="https://huggingface.co/insilicomedicine/nach0_base" target="_blank">Base nach0</a> • ⏬ <a href="https://huggingface.co/insilicomedicine/nach0_base" target="_blank">Large nach0</a> <br>
13
+ </p>
14
+ <div align=center><img src="images/nach0_Pub_2.png" width="70%" height="70%" /></div>
15
+ <h2 id="1">Overview</h2>
16
+
17
+ - nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge.
18
+
19
+ - We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions.
20
+
21
+ - Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
22
+
23
+ <h2 id="1">Tasks</h2>
24
+ Datasets used for training and evaluation. Colour represents the type of tasks. Yellow and blue datasets are single-domain, typically requiring regression/classification losses or generation in the target domain (natural language or SMILES strings). Gradients from yellow to blue represent cross-domain generation tasks that require natural language input and SMILES output, or vise versa.
25
+ <div align=center><img src="images/nach0_Pub_1.png" width="70%" height="70%" /></div>
26
+
27
+ <h2> Model Usage Guide</h2>
28
+
29
+ To use model for the inference follow the steps bellow:
30
+
31
+ 1. Preprocess the input by replacing the atom tokens with special tokens.
32
+
33
+ ```python
34
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
35
+ import re
36
+ from rdkit.Chem import MolFromSmiles
37
+ import string
38
+ from rdkit import RDLogger
39
+ RDLogger.DisableLog('rdApp.*')
40
+ atoms_tokens = ['Ag','Al','As','Au','B','Ba','Bi','Br','C','Ca',
41
+ 'Cd','Cl','Co','Cr','Cs','Cu','F','Fe','Ga','Gd',
42
+ 'Ge','H','Hg','I','In','K','Li','M','Mg','Mn',
43
+ 'Mo','N','Na','O','P','Pt','Ru','S','Sb','Sc',
44
+ 'Se','Si','Sn','V','W','Z','Zn','c','e','n','o','p','s']
45
+ atoms_tokens = sorted(atoms_tokens, key=lambda s: len(s), reverse=True)
46
+ SMI_REGEX_PATTERN = r"(\[|\]|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9]|" + \
47
+ '|'.join(atoms_tokens) + ")"
48
+ regex = re.compile(SMI_REGEX_PATTERN)
49
+ def clean_output_sequence(output_sequence):
50
+ return output_sequence.replace('</s>', '').replace('<sm_', '').replace(' sm_', '').replace('>', '').strip()
51
+ def add_special_symbols(text):
52
+ output = []
53
+ for word in text.split():
54
+ tokens = [token for token in regex.findall(word)]
55
+ if len(tokens) > 4 and (word == ''.join(tokens)) and MolFromSmiles(word):
56
+ output.append(''.join(['<sm_'+t+'>' for t in tokens]))
57
+ else:
58
+ output.append(word)
59
+ return ' '.join(output)
60
+ PROMPT = """Given the following reactants and reagents, please provide a possible product.
61
+ CCN(CC)CC.CCN=C=NCCCN(C)C.CN(C)C=O.Cl.NC1=CC=C(Cl)C=C1N.O.O=C(O)CCCCCNC(=O)C=C1C2=CC=CC=C2C2=CC=CC=C12.OC1=CC=CC2=C1N=NN2.[Cl-].[Na+]"""
62
+ PROMPT = add_special_symbols(PROMPT)
63
+ ```
64
+ 2. Load the model checkoint
65
+
66
+ ```python
67
+ model = AutoModelForSeq2SeqLM.from_pretrained('insilicomedicine/nach0_base')
68
+ tokenizer = AutoTokenizer.from_pretrained('insilicomedicine/nach0_base')
69
+ ```
70
+
71
+ 3. Generate response to prompt and replace special tokens with corresponding atom tokens
72
+ ```python
73
+ input_text_ids = tokenizer(PROMPT, padding="longest", max_length=512, truncation=True, return_tensors="pt")
74
+ generated_text_ids = model.generate(**input_text_ids, do_sample=True, top_k=100, top_p=0.95, max_length=512)
75
+ generated_text = tokenizer.batch_decode(generated_text_ids, skip_special_tokens=True)[0]
76
+ generated_text = clean_output_sequence(generated_text)
77
+ ```
78
+ ```python
79
+ # NC1=CC=C(Cl)C=C1NC(=O)CCCCCNC(=O)C=C1C2=CC=CC=C2C2=CC=CC=C12
80
+ ```
81
+
82
+
83
+ <h3> References</h3>
84
+ If you use our repository, please cite the following related paper:
85
+
86
+ ```
87
+ @inproceedings{....
88
+ }
89
+ ```