File size: 3,395 Bytes
7a49887 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
license: other
pipeline_tag: token-classification
tags:
- biology
- RNA
- Torsional
- Angles
---
# `RNA-TorsionBERT`
## Model Description
`RNA-TorsionBERT` is a 331 MB parameter BERT-based language model that predicts RNA torsional and pseudo-torsional angles from the sequence.
`RNA-TorsionBERT` is a DNABERT model that was pre-trained on ~4200 RNA structures before being fine-tuned on 185 non-redundant structures.
It provides an improvement of MAE of 6.2° over the previous state-of-the-art model, SPOT-RNA-1D, on the Test Set (composed of RNA-Puzzles and CASP-RNA).
| Model | alpha | beta | gamma | delta | epsilon | zeta | chi | eta | theta |
|------------------|----------|-------|-------|-------|---------|-------|-------|-------|-------|
| **RNA-TorsionBERT** | 37.3 | 19.6 | 29.4 | 13.6 | 16.6 | 26.6 | 14.7 | 20.1 | 25.4 |
| SPOT-RNA-1D | 45.7 | 23 | 33.6 | 19 | 21.1 | 34.4 | 19.3 | 28.9 | 33.9 |
**Key Features**
* Torsional and Pseudo-torsional angles prediction
* Predict sequences up to 512 nucleotides
## Usage
Get started generating text with `RNA-TorsionBERT` by using the following code snippet:
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)
model = AutoModel.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)
sequence = "ACG CGG GGT GTT"
params_tokenizer = {
"return_tensors": "pt",
"padding": "max_length",
"max_length": 512,
"truncation": True,
}
inputs = tokenizer(sequence, **params_tokenizer)
output = model(inputs)["logits"]
```
- Please note that it was fine-tuned from a DNABERT-3 model and therefore the tokenizer is the same as the one used for DNABERT. Nucleotide `U` should therefore be replaced by `T` in the input sequence.
- The output is the sinus and the cosine for each angle. The angles are in the following order: `alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`, `chi`, `eta`, `theta`.
To convert the predictions into angles, you can use the following code snippet:
```python
from typing import Optional
import numpy as np
ANGLES_ORDER = [ "alpha", "beta", "gamma", "delta", "epsilon", "zeta", "chi", "eta", "theta" ]
def convert_sin_cos_to_angles(output: np.ndarray, input_ids: Optional[np.ndarray] = None):
"""
Convert the raw predictions of the RNA-TorsionBERT into angles.
It converts the cos and sinus into angles using:
alpha = arctan(sin(alpha)/cos(alpha))
:param output: Dictionary with the predictions of the RNA-TorsionBERT per angle
:param input_ids: the input_ids of the RNA-TorsionBERT. It allows to only select the of the sequence,
and not the special tokens.
:return: a np.ndarray with the angles for the sequence
"""
if input_ids is not None:
output[ (input_ids == 0) | (input_ids == 1) | (input_ids == 2) | (input_ids == 3) | (input_ids == 4) ] = np.nan
pair_indexes, impair_indexes = np.arange(0, output.shape[-1], 2), np.arange(
1, output.shape[-1], 2
)
sin, cos = output[:, :, impair_indexes], output[:, :, pair_indexes]
tan = np.arctan2(sin, cos)
angles = np.degrees(tan)
return angles
output = output.cpu().detach().numpy()
input_ids = inputs["input_ids"].cpu().detach().numpy()
real_angles = convert_sin_cos_to_angles(output, input_ids)
```
|