|
--- |
|
license: apache-2.0 |
|
library_name: transformers |
|
datasets: |
|
- vector-institute/s2ef-15m |
|
metrics: |
|
- mae |
|
pipeline_tag: graph-ml |
|
--- |
|
|
|
# AtomFormer base model |
|
|
|
This model is a transformer-based model that leverages gaussian pair-wise positional embeddings to train on atomistic graph data. It |
|
is part of a suite of datasets/models/utilities in the AtomGen project that supports other methods for pre-training and fine-tuning |
|
models on atomistic graphs. |
|
|
|
|
|
## Model description |
|
|
|
AtomFormer is a transformer model with modifcations to train on atomstic graphs. It builds primarily on the work |
|
from uni-mol+ to add the pair-wise pos. embeds. to the attention mask to leverage 3-D positional information. |
|
This model was pre-trained on a diverse set of aggregated atomistic datasets where the target task is the per-atom |
|
force prediction and the per-system energy prediction. |
|
|
|
The model also includes metadata regarding the atomic species that are being modeled, this includes the atomic radius, |
|
electronegativity, valency, etc. The metadata is normalized and projected to be added to the atom embeddings in the model. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for either force and energy prediction, but it's mostly intended to |
|
be fine-tuned on a downstream task. The performance of the model as a force and energy prediction model |
|
is not validated, it was primarily used a pre-training task. |
|
|
|
|
|
### How to use |
|
|
|
Here is how to use the model to extract features from the pre-trained backbone: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModel |
|
model = AutoModel.from_pretrained("vector-institute/atomformer-base", |
|
trust_remote_code=True) |
|
|
|
input_ids = torch.randint(0, 50, (1, 10)) |
|
coords = torch.randn(1, 10, 3) |
|
attn_mask = torch.ones(1, 10) |
|
|
|
output = model(input_ids, coords=coords, attention_mask=attention_mask) |
|
output.shape # (torch.Size([1, 10, 768]) |
|
``` |
|
|
|
|
|
## Training data |
|
|
|
AtomFormer is trained on an aggregated S2EF dataset from multiple sources such as OC20, OC22, ODAC23, MPtrj, and SPICE |
|
with structures and energies/forces for pre-training. The pre-training data includes total energies and formation |
|
energies but trains using formation energy (which isn't included for OC22, indicated by "has_formation_energy" column). |
|
|
|
|
|
### Preprocessing |
|
|
|
The model expects input in the form of tokenized atomic symbols represented as `input_ids` and 3D coordinates represented |
|
as `coords`. For the pre-training task it also expects labels for the `forces` and `formation_energy`. |
|
|
|
The `DataCollatorForAtomModeling` utility in the AtomGen library has the capacity to perform dynamic padding to batch the |
|
data together. It also offers the option to flatten the data and provide a `batch` column for gnn-style training. |
|
|
|
|
|
### Pretraining |
|
|
|
The model was trained on a node of 4xA40 (48 GB) for 10 epochs (~2 weeks). See the |
|
[training code](https://github.com/VectorInstitute/AtomGen) for all hyperparameters |
|
details. |
|
|
|
## Evaluation results |
|
|
|
We use the Atom3D dataset to evaluate the model's performance on downstream tasks. |
|
|
|
When fine-tuned on downstream tasks, this model achieves the following results: |
|
|
|
| Task | SMP | PIP | RES | MSP | LBA | LEP | PSR | RSR | |
|
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| |
|
| | 1.077 | TBD | TBD | TBD | TBD | TBD | TBD | TBD | |