---
license: cc-by-nc-2.0
library_name: transformers
datasets:
- CCDS
- Ensembl
pipeline_tag: fill-mask
tags:
- protein language model
- biology
widget:
- text: ( Z [MASK] V L P Y G D E K L S P Y G D G G D V G Q I F s C B L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v
  example_title: Fill codon mask (Y)
---

# cdsBERT
<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">

## Model description

cdsBERT is pLM with a codon vocabulary that was seeded with [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) and trained with a novel vocabulary extension pipeline called MELD. cdsBERT offers a highly biologically relevant latent space with excellent EC number prediction surpassing ProtBERT.

## How to use

```python
# Imports
import re
import torch
import torch.nn.functional as F
from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
model.to(device) # move to device
model.eval() # put in eval mode

sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
sequence = ' '.join(list(sequence)) # need spaces in-between codons

example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
with torch.no_grad():
    matrix_embedding = model(**example).last_hidden_state.cpu()

vector_embedding = matrix_embedding.mean(dim=0)
```

## Intended use and limitations
cdsBERT serves as a general purpose 

## Our lab
The [Gleghorn lab](https://www.gleghornlab.com/) is an interdiciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.

## Please cite
Coming soon!