--- license: apache-2.0 library_name: transformers datasets: - vector-institute/s2ef-15m metrics: - mae pipeline_tag: graph-ml --- # AtomFormer base model This model is a transformer-based model that leverages gaussian pair-wise positional embeddings to train on atomistic graph data. It is part of a suite of datasets/models/utilities in the AtomGen project that supports other methods for pre-training and fine-tuning models on atomistic graphs. ## Model description AtomFormer is a transformer model with modifcations to train on atomstic graphs. It builds primarily on the work from uni-mol+ to add the pair-wise pos. embeds. to the attention mask to leverage 3-D positional information. This model was pre-trained on a diverse set of aggregated atomistic datasets where the target task is the per-atom force prediction and the per-system energy prediction. The model also includes metadata regarding the atomic species that are being modeled, this includes the atomic radius, electronegativity, valency, etc. The metadata is normalized and projected to be added to the atom embeddings in the model. ## Intended uses & limitations You can use the raw model for either force and energy prediction, but it's mostly intended to be fine-tuned on a downstream task. The performance of the model as a force and energy prediction model is not validated, it was primarily used a pre-training task. ### How to use Here is how to use the model to extract features from the pre-trained backbone: ```python import torch from transformers import AutoModel model = AutoModel.from_pretrained("vector-institute/atomformer-base", trust_remote_code=True) input_ids = torch.randint(0, 50, (1, 10)) coords = torch.randn(1, 10, 3) attn_mask = torch.ones(1, 10) output = model(input_ids, coords=coords, attention_mask=attention_mask) output.shape # (torch.Size([1, 10, 768]) ``` ## Training data AtomFormer is trained on an aggregated S2EF dataset from multiple sources such as OC20, OC22, ODAC23, MPtrj, and SPICE with structures and energies/forces for pre-training. The pre-training data includes total energies and formation energies but trains using formation energy (which isn't included for OC22, indicated by "has_formation_energy" column). ### Preprocessing The model expects input in the form of tokenized atomic symbols represented as `input_ids` and 3D coordinates represented as `coords`. For the pre-training task it also expects labels for the `forces` and `formation_energy`. The `DataCollatorForAtomModeling` utility in the AtomGen library has the capacity to perform dynamic padding to batch the data together. It also offers the option to flatten the data and provide a `batch` column for gnn-style training. ### Pretraining The model was trained on a node of 4xA40 (48 GB) for 10 epochs (~2 weeks). See the [training code](https://github.com/VectorInstitute/AtomGen) for all hyperparameters details. ## Evaluation results We use the Atom3D dataset to evaluate the model's performance on downstream tasks. When fine-tuned on downstream tasks, this model achieves the following results: | Task | SMP | PIP | RES | MSP | LBA | LEP | PSR | RSR | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 1.077 | TBD | TBD | TBD | TBD | TBD | TBD | TBD |