Spaces:
Sleeping
Sleeping
# **ESM-Scan** | |
Calculate the <u>fitness of single amino acid substitutions</u> on proteins, using a [zero-shot](https://doi.org/10.1101/2021.07.09.450648) [language model predictor](https://github.com/facebookresearch/esm) | |
<details> | |
<summary> <b> USAGE INSTRUCTIONS </b> </summary> | |
### **Setup** | |
No setup is required, just fill the input boxes with the required data and click on the `Run` button. | |
A list of examples can be found at the bottom of the page, click on them to autofill the fields. | |
If the server is not used for some time, it will go into standby. | |
Running a calculation resumes the tool from standby, the first run might take longer due to startup and model loading. | |
### **Input** | |
- write the protein full amino acid sequence to be analysed in the **Sequence** text box | |
jolly charachters (e.g. `-X.B`) can be inserted but, at the moment, visualisation cannot handle them | |
- write the substitutions to test in the **Substitutions** box | |
there are three running modes that can be used, depending on the input: | |
+ *single substitution* or list thereof (in the form of `R218K R218W`): the single substitution is scored | |
+ *residue position* or list thereof: all possible substitutions will be evaluated | |
+ *same-length sequence*: the differing amino acid substitutions will be evaluated, one by one | |
+ any other *different input*: a deep mutational scan of the full sequence will be performed | |
- the ESM model to use for the calculations can be chosen among those that are available on Hugging Face Model Hub; | |
`esm2_t33_650M_UR50D` offers the best expense-accuracy tradeoff[*](https://doi.org/10.1126/science.ade2574) | |
- the `masked-marginals` scoring strategy considers sequence context at inference time, being slower but more accurate; | |
in case of long runtimes, you can tick the box off to speed the calculations up significantly, sacrificing accuracy | |
- when running a deep mutational scan, it is recommended to use smaller models (8M, 35M, 150M parameters), since the runtime is significant, especially for longer sequences and the server might be overloaded; | |
over 30 min might be necessary for calculating a 300-residue-long sequence with larger models | |
in general, accuracy is influenced significantly by the scoring strategy and less so by the model size, so it is suggested to reduce the latter first when optimising for runtime; | |
the scoring strategy computational cost scales with the number of substitutions tested, while the model’s with the wild-type sequence length | |
- it is possible to calculate the effect of multiple concurrent substitutions, but this has to be done manually, by changing the input sequence and running the calculation again | |
### **Output** | |
Your results will be shown in a color-coded table, except for the deep mutational scan which will yield a heatmap. | |
The output data can be downloaded from the box at the bottom. | |
File extensions are not supported by the server and need to be appended to the filenames after downloading: | |
- `CSV` for tables | |
- `SVG` for full-sequence deep mutational scan | |
</details> | |