rna_torsionBERT / README.md
sayby's picture
Create README.md
7a49887 verified
metadata
license: other
pipeline_tag: token-classification
tags:
  - biology
  - RNA
  - Torsional
  - Angles

RNA-TorsionBERT

Model Description

RNA-TorsionBERT is a 331 MB parameter BERT-based language model that predicts RNA torsional and pseudo-torsional angles from the sequence.

RNA-TorsionBERT is a DNABERT model that was pre-trained on ~4200 RNA structures before being fine-tuned on 185 non-redundant structures.

It provides an improvement of MAE of 6.2° over the previous state-of-the-art model, SPOT-RNA-1D, on the Test Set (composed of RNA-Puzzles and CASP-RNA).

Model alpha beta gamma delta epsilon zeta chi eta theta
RNA-TorsionBERT 37.3 19.6 29.4 13.6 16.6 26.6 14.7 20.1 25.4
SPOT-RNA-1D 45.7 23 33.6 19 21.1 34.4 19.3 28.9 33.9

Key Features

  • Torsional and Pseudo-torsional angles prediction
  • Predict sequences up to 512 nucleotides

Usage

Get started generating text with RNA-TorsionBERT by using the following code snippet:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)
model = AutoModel.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)

sequence = "ACG CGG GGT GTT"
params_tokenizer = {
    "return_tensors": "pt",
    "padding": "max_length",
    "max_length": 512,
    "truncation": True,
}
inputs = tokenizer(sequence, **params_tokenizer)
output = model(inputs)["logits"]
  • Please note that it was fine-tuned from a DNABERT-3 model and therefore the tokenizer is the same as the one used for DNABERT. Nucleotide U should therefore be replaced by T in the input sequence.
  • The output is the sinus and the cosine for each angle. The angles are in the following order: alpha, beta, gamma, delta, epsilon, zeta, chi, eta, theta.

To convert the predictions into angles, you can use the following code snippet:

from typing import Optional

import numpy as np

ANGLES_ORDER = [ "alpha", "beta", "gamma", "delta", "epsilon", "zeta", "chi", "eta", "theta" ]

def convert_sin_cos_to_angles(output: np.ndarray, input_ids: Optional[np.ndarray] = None):
    """
    Convert the raw predictions of the RNA-TorsionBERT into angles.
    It converts the cos and sinus into angles using:
        alpha = arctan(sin(alpha)/cos(alpha))
    :param output: Dictionary with the predictions of the RNA-TorsionBERT per angle
    :param input_ids: the input_ids of the RNA-TorsionBERT. It allows to only select the of the sequence,
        and not the special tokens.
    :return: a np.ndarray with the angles for the sequence
    """
    if input_ids is not None:
        output[ (input_ids == 0) | (input_ids == 1) | (input_ids == 2) | (input_ids == 3) | (input_ids == 4) ] = np.nan
    pair_indexes, impair_indexes = np.arange(0, output.shape[-1], 2), np.arange(
        1, output.shape[-1], 2
    )
    sin, cos = output[:, :, impair_indexes], output[:, :, pair_indexes]
    tan = np.arctan2(sin, cos)
    angles = np.degrees(tan)
    return angles

output = output.cpu().detach().numpy()
input_ids = inputs["input_ids"].cpu().detach().numpy()
real_angles = convert_sin_cos_to_angles(output, input_ids)