File size: 1,206 Bytes
ccd5fd8
 
 
 
 
a846761
ccd5fd8
b1a2ed7
ccd5fd8
a846761
 
 
ccd5fd8
a846761
 
ccd5fd8
a846761
 
ccd5fd8
cdc0615
a846761
 
 
 
 
 
 
 
ccd5fd8
a846761
 
 
 
 
 
ccd5fd8
a846761
ccd5fd8
a846761
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
library_name: transformers
tags: []
---

# MolXPT

Our model is a variant of GPT pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Our model is based on [BioGPT](https://huggingface.co/microsoft/biogpt) and we redefine the tokenizer. 

## Example Usage
```python
from transformers import AutoTokenizer, BioGptForCausalLM

model = BioGptForCausalLM.from_pretrained("zequnl/molxpt")
molxpt_tokenizer = AutoTokenizer.from_pretrained("zequnl/molxpt", trust_remote_code=True)

model = model.cuda()
model.eval()

input_ids = molxpt_tokenizer('<start-of-mol>CC(=O)OC1=CC=CC=C1C(=O)O<end-of-mol> is ', return_tensors="pt").input_ids.cuda()
output = model.generate(
    input_ids,
    max_new_tokens=300,
    num_return_sequences=4,
    temperature=0.75,
    top_p=0.95,
    do_sample=True,
)

for i in range(4):
    s = molxpt_tokenizer.decode(output[i])
    print(s)
```
## References
For more information, please refer to our paper and GitHub repository.

Paper: [MolXPT: Wrapping Molecules with Text for Generative Pre-training](https://aclanthology.org/2023.acl-short.138/)

Authors: *Zequn Liu, Wei Zhang, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Ming Zhang, Tie-Yan Liu*