metadata

license: mit
library_name: transformers
datasets:
  - CCDS
  - Ensembl
pipeline_tag: feature-extraction
tags:
  - protein language model
  - biology
widget:
  - text: ( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C L Q ]
    example_title: Example CCDS embedding extraction

cdsBERT

Model description

How to use

# Imports
import re
import torch
import torch.nn.functional as F
from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
model.to(device) # move to device
model.eval() # put in eval mode

sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
sequence = ' '.join(list(sequence)) # need spaces in-between codons

example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
with torch.no_grad():
    matrix_embedding = model(**example).last_hidden_state.cpu()

vector_embedding = matrix_embedding.mean(dim=0)

Intended use and limitations

Our lab

The Gleghorn lab is an interdisciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.

Please cite

Coming soon!