---
license: mit
library_name: transformers
datasets:
- CCDS
- Ensembl
pipeline_tag: feature-extraction
tags:
- protein language model
- biology
widget:
- text: >-
    ( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C \# L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v ]
  example_title: Example CCDS embedding extraction
---

# cdsBERT
<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">

## Model description


## How to use

```python
# Imports
import re
import torch
import torch.nn.functional as F
from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
model.to(device) # move to device
model.eval() # put in eval mode

sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
sequence = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence))) # need spaces in-between amino acids, replace rare amino acids with X

example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
with torch.no_grad():
    matrix_embedding = model(**example).last_hidden_state.cpu()

vector_embedding = matrix_embedding.mean(dim=0)
```

## Intended use and limitations


## Our lab
The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.

## Please cite
Coming soon!