--- license: mit library_name: transformers datasets: - CCDS - Ensembl pipeline_tag: feature-extraction tags: - protein language model - biology widget: - text: >- ( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C \# L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v ] example_title: Example CCDS embedding extraction --- # cdsBERT ## Model description ## How to use ```python # Imports import re import torch import torch.nn.functional as F from transformers import BertForMaskedLM, BertTokenizer model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device model.to(device) # move to device model.eval() # put in eval mode sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1 sequence = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence))) # need spaces in-between amino acids, replace rare amino acids with X example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example with torch.no_grad(): matrix_embedding = model(**example).last_hidden_state.cpu() vector_embedding = matrix_embedding.mean(dim=0) ``` ## Intended use and limitations ## Our lab The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation. ## Please cite Coming soon!