File size: 5,239 Bytes
af48d34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
tags:
  - safe
  - mamba
  - attention
  - hybrid
  - molecular-generation
  - smiles
  - generated_from_trainer
datasets:
  - katielink/moses
model-index:
  - name: HYBRID_20M
    results: []
---

# HYBRID_20M

HYBRID_20M is a model developed for molecular generation tasks, incorporating both **Mamba** and **Attention** layers to utilize the advantages of each architecture. **The training code is available at [https://github.com/Anri-Lombard/Mamba-SAFE](https://github.com/Anri-Lombard/Mamba-SAFE).** The model was trained from scratch on the [MOSES](https://huggingface.co/datasets/katielink/moses) dataset, which has been converted from SMILES to the SAFE (SMILES Augmented For Encoding) format to improve molecular representation for machine learning applications. HYBRID_20M exhibits performance comparable to both transformer-based models such as [SAFE_20M](https://huggingface.co/anrilombard/safe-20m) and mamba-based models like [SSM_20M](https://huggingface.co/anrilombard/ssm-20m).

## Evaluation Results

HYBRID_20M demonstrates performance that is on par with both transformer-based and mamba-based models in molecular generation tasks. The model ensures high validity and diversity in the generated molecular structures, indicating the effectiveness of combining Mamba's sequence modeling with Attention mechanisms.

## Model Description

HYBRID_20M employs a hybrid architecture that integrates the **Mamba** framework with **Attention** layers. This integration allows the model to benefit from Mamba's efficient sequence modeling capabilities and the contextual understanding provided by Attention mechanisms.

### Mamba Framework

The Mamba framework, utilized in HYBRID_20M, was introduced in the following publication:

```bibtex
@article{gu2023mamba,
  title={Mamba: Linear-time sequence modeling with selective state spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
```

We acknowledge the authors for their contributions to sequence modeling.

### Attention Mechanisms

Attention layers enhance the model's ability to focus on relevant parts of the input sequence, facilitating the capture of long-range dependencies and contextual information. This capability is essential for accurately generating complex molecular structures.

### SAFE Framework

The SAFE framework, also employed in HYBRID_20M, was introduced in the following publication:

```bibtex
@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}
```

We acknowledge the authors for their contributions to molecular design.

## Intended Uses & Limitations

### Intended Uses

HYBRID_20M is intended for:

- **Generating Molecular Structures:** Creating novel molecules with desired properties.
- **Exploring Chemical Space:** Investigating the vast array of possible chemical compounds for research and development.
- **Assisting in Material Design:** Facilitating the creation of new materials with specific functionalities.

### Limitations

- **Validation Required:** Outputs should be validated by domain experts before practical application.
- **Synthetic Feasibility:** Generated molecules may not always be synthetically feasible.
- **Dataset Scope:** The model's knowledge is limited to the chemical space represented in the MOSES dataset.

## Training and Evaluation Data

The model was trained on the [MOSES (MOlecular SEtS)](https://huggingface.co/datasets/katielink/moses) dataset, a benchmark dataset for molecular generation. The dataset was converted from SMILES to the SAFE format to enhance molecular representation for machine learning tasks.

## Training Procedure

### Training Hyperparameters

The following hyperparameters were used during training:

- **Learning Rate:** 0.0005
- **Training Batch Size:** 32
- **Evaluation Batch Size:** 32
- **Seed:** 42
- **Gradient Accumulation Steps:** 2
- **Total Training Batch Size:** 64
- **Optimizer:** Adam (betas=(0.9, 0.999), epsilon=1e-08)
- **Learning Rate Scheduler:** Linear with 20,000 warmup steps
- **Number of Epochs:** 10

### Framework Versions

- **Mamba:** [Specify version]
- **PyTorch:** [Specify version]
- **Datasets:** 2.20.0
- **Tokenizers:** 0.19.1

## Acknowledgements

We acknowledge the authors of the [Mamba](https://github.com/Anri-Lombard/Mamba-SAFE) and SAFE frameworks for their contributions to sequence modeling and molecular design.

## References

```bibtex
@article{gu2023mamba,
  title={Mamba: Linear-time sequence modeling with selective state spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}

@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}
```