library_name: transformers
tags:
- chemistry
- biology
- SELFIES
- life-sciences
license: mit
datasets:
- mikemayuare/PubChem10M_SMILES_SELFIES
Model Card for Model ID
MLM RoBERTa-based pretrained model. Ready to fine-tune on specific tasks.
Model Details
Model Description
MLM RoBERTa-based pretrained model. 2 million of Self-Referencing Embedded Strings (SELFIES) were used and BPE as tokenizer.
- Developed by: Miguelangel Leon Mayuare
- Funded by: This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442.
- Shared by: Miguelangel Leon Mayuare
- Model type: RoBERTa-based
- Language(s) (NLP): SELFIES
- License: MIT
Model Sources
- Paper: On review
Uses
The model instended use is for fine-tuning on dowstream tasks were SELFIES is the main input.
Direct Use
The model can be directly used for the classification of chemical compounds and prediction of molecular properties using SELFIES representations.
Downstream Use
The model can be fine-tuned for specific tasks such as drug discovery, toxicity prediction, and other cheminformatics applications using specific datasets.
Out-of-Scope Use
The model should not be used for tasks outside of cheminformatics or without proper validation for the specific task. Misuse includes using the model for generating invalid chemical compounds or predictions outside the domain of trained data. Only works with SELFIES, for SMILES search miekmayuare repository.
Bias, Risks, and Limitations
The model may inherit biases from the training data. Limitations include potential overfitting to the pre-training tasks and resource intensity for training and fine-tuning.
Recommendations
2 million SELFIES were used to pretrain the model in order to mitigate missrepresentation (over and under-representation) of any type of molecules. Validation on known datasets for downstream tasks is the best way to see its limitations.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mikemayuare/SELFYBPE")
model = AutoModel.from_pretrained("mikemayuare/SELFYBPE")
Training Details
Training Data
The training data comprised 2 million molecules from the PubChem dataset. SMILES strings were converted to SELFIES using the selfies library.
Training Procedure
The models were pre-trained for 20 epochs using the AdamW optimizer on an NVIDIA 3060 GPU with 12GiB of VRAM.
Preprocessing
SMILES strings were converted to SELFIES using the selfies library, and tokenizers were trained on a subset of 1 million molecules from the PubChem dataset.
Training Hyperparameters
- Training regime: fp32
- Batch size: 32
- Number of epochs: 20
- Optimizer: AdamW
Speeds, Sizes, Times
Training time was approximately 72 hours on the specified hardware. Checkpoint sizes are approximately 500MB each.
Evaluation
Testing Data
Testing was conducted on MoleculeNet datasets, specifically BBBP, HIV, and Tox21.
Factors
Evaluation metrics were disaggregated by dataset and task type (e.g., binary classification for BBBP).
Metrics
The primary evaluation metric was the ROC-AUC score, which is commonly used for binary classification tasks in cheminformatics (on fine-tuned models).
Results
The models tokenized with APE generally outperformed those tokenized with BPE. SMILES models showed better performance than SELFIES models in most cases.
Summary
The model achieved competitive performance on standard benchmarks, outperforming several baseline models in specific tasks.
Model Examination
Interpretability analyses showed that models tokenized with APE preserved the chemical context better than those with BPE, leading to higher classification accuracy.
Environmental Impact
Carbon emissions were estimated using the Machine Learning Impact calculator.
- Hardware Type: NVIDIA 3060 GPU
- Hours used: 72 hours
- Cloud Provider: Not applicable
- Compute Region: Local
- Carbon Emitted: Approximately 50 kg CO2eq
Technical Specifications
Model Architecture and Objective
The model architecture is based on RoBERTa with 6 hidden layers, 768 hidden size, 1536 intermediate size, and 12 attention heads.
Compute Infrastructure
Hardware
- Type: NVIDIA 3060 GPU
- VRAM: 12GiB
Software
- Framework: PyTorch
- Libraries: transformers, selfies, DeepChem, Optuna
Citation
BibTeX:
@mastersthesis{leon2024chemical,
title={Chemical Language Modeling},
author={Miguelangel Augusto Leon Mayuare},
year={2024},
school={NOVA Information Management School}
}
APA:
Mayuare, M. A. L. (2024). Chemical Language Modeling (Master's thesis). NOVA Information Management School.
Glossary
SELFIES: A string-based representation of molecules. SMILES: Simplified Molecular Input Line Entry System, a notation for describing the structure of chemical species.
More Information
For more details, refer to the (pending publication)
Model Card Authors
- Miguelangel Augusto Leon Mayuare
Model Card Contact
For inquiries, please contact [email protected]