metadata
license: mit
library_name: transformers
datasets:
- CCDS
- Ensembl
pipeline_tag: feature-extraction
tags:
- protein language model
- biology
widget:
- text: ( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C L Q ]
example_title: Example CCDS embedding extraction
cdsBERT
Model description
How to use
# Imports
import re
import torch
import torch.nn.functional as F
from transformers import BertForMaskedLM, BertTokenizer
model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
model.to(device) # move to device
model.eval() # put in eval mode
sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
sequence = ' '.join(list(sequence)) # need spaces in-between codons
example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
with torch.no_grad():
matrix_embedding = model(**example).last_hidden_state.cpu()
vector_embedding = matrix_embedding.mean(dim=0)
Intended use and limitations
Our lab
The Gleghorn lab is an interdisciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.
Please cite
Coming soon!