lhallee commited on
Commit
86782c8
1 Parent(s): 0e1cee2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -53
README.md CHANGED
@@ -2,45 +2,23 @@
2
  license: mit
3
  library_name: transformers
4
  datasets:
5
- - BIOGRID
6
- - Negatome
7
- pipeline_tag: text-classification
8
  tags:
9
  - protein language model
10
  - biology
11
  widget:
12
  - text: >-
13
- M S H S V K I Y D T C I G C T Q C V R A C P T D V L E M I P W G G C K A K Q
14
- I A S A P R T E D C V G C K R C E S A C P T D F L S V R V Y L W H E T T R S
15
- M G L A Y [SEP] M I N L P S L F V P L V G L L F P A V A M A S L F L H V E K
16
- R L L F S T K K I N
17
- example_title: Non-interacting proteins
18
- - text: >-
19
- M S I N I C R D N H D P F Y R Y K M P P I Q A K V E G R G N G I K T A V L N
20
- V A D I S H A L N R P A P Y I V K Y F G F E L G A Q T S I S V D K D R Y L V
21
- N G V H E P A K L Q D V L D G F I N K F V L C G S C K N P E T E I I I T K D
22
- N D L V R D C K A C G K R T P M D L R H K L S S F I L K N P P D S V S G S K
23
- K K K K A A T A S A N V R G G G L S I S D I A Q G K S Q N A P S D G T G S S
24
- T P Q H H D E D E D E L S R Q I K A A A S T L E D I E V K D D E W A V D M S
25
- E E A I R A R A K E L E V N S E L T Q L D E Y G E W I L E Q A G E D K E N L
26
- P S D V E L Y K K A A E L D V L N D P K I G C V L A Q C L F D E D I V N E I
27
- A E H N A F F T K I L V T P E Y E K N F M G G I E R F L G L E H K D L I P L
28
- L P K I L V Q L Y N N D I I S E E E I M R F G T K S S K K F V P K E V S K K
29
- V R R A A K P F I T W L E T A E S D D D E E D D E [SEP] M S I E N L K S F D
30
- P F A D T G D D E T A T S N Y I H I R I Q Q R N G R K T L T T V Q G V P E E
31
- Y D L K R I L K V L K K D F A C N G N I V K D P E M G E I I Q L Q G D Q R A
32
- K V C E F M I S Q L G L Q K K N I K I H G F
33
- example_title: Interacting proteins
34
  ---
35
- <img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/Ro4uhQDurP-x7IHJj11xa.png" width="350">
36
-
37
- ## Model description
38
 
39
- SYNTERACT (SYNThetic data-driven protein-protein intERACtion Transformer) is a fine-tuned version of [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) that attends two amino acid sequences separated by [SEP] to determine if they plausibly interact in biological context.
 
40
 
41
- We utilized the multivalidated physical interaction dataset from BIORGID, Negatome, and synthetic negative samples to train our model. Check out our [preprint](https://www.biorxiv.org/content/10.1101/2023.06.07.544109v1.full) for more details.
42
 
43
- SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% accuracy on real unseen examples, and is already being used to accelerate drug target screening and peptide therapeutic design.
44
 
45
 
46
  ## How to use
@@ -50,41 +28,29 @@ SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% acc
50
  import re
51
  import torch
52
  import torch.nn.functional as F
53
- from transformers import BertForSequenceClassification, BertTokenizer
54
 
55
- model = BertForSequenceClassification.from_pretrained('lhallee/SYNTERACT') # load model
56
- tokenizer = BertTokenizer.from_pretrained('lhallee/SYNTERACT') # load tokenizer
57
  device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
58
  model.to(device) # move to device
59
  model.eval() # put in eval mode
60
 
61
- sequence_a = 'MEKSCSIGNGREQYGWGHGEQCGTQFLECVYRNASMYSVLGDLITYVVFLGATCYAILFGFRLLLSCVRIVLKVVIALFVIRLLLALGSVDITSVSYSG' # Uniprot A1Z8T3
62
- sequence_b = 'MRLTLLALIGVLCLACAYALDDSENNDQVVGLLDVADQGANHANDGAREARQLGGWGGGWGGRGGWGGRGGWGGRGGWGGRGGWGGGWGGRGGWGGRGGGWYGR' # Uniprot A1Z8H0
63
- sequence_a = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_a))) # need spaces inbetween amino acids
64
- sequence_b = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_b))) # replace rare amino acids with X
65
- example = sequence_a + ' [SEP] ' + sequence_b # add SEP token
66
 
67
- example = tokenizer(example, return_tensors='pt', padding=False).to(device) # tokenize example
68
  with torch.no_grad():
69
- logits = model(**example).logits.cpu().detach() # get logits from model
70
 
71
- probability = F.softmax(output, dim=-1) # use softmax to get "confidence" in the prediction
72
- prediction = probability.argmax(dim=-1) # 0 for no interaction, 1 for interaction
73
  ```
74
 
75
  ## Intended use and limitations
76
- We define a protein-protein interaction as physical contact that mediates chemical or conformational change, especially with non-generic function. However, due to SYNTERACTS propensity to predict false positives we believe that it identifies plausible conformational changes caused by interactions without relevance to function. Therefore, predictions by SYNTERACT should always be taken with a grain of salt and used as a means of hypothesis generation or secondary validation.
77
 
78
  ## Our lab
79
- The [Gleghorn lab](https://www.gleghornlab.com/) is an interdiciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.
80
 
81
  ## Please cite
82
- @article {Hallee2023.06.07.544109,
83
- author = {Logan Hallee and Jason P. Gleghorn},
84
- title = {Protein-Protein Interaction Prediction is Achievable with Large Language Models},
85
- elocation-id = {2023.06.07.544109},
86
- year = {2023},
87
- doi = {10.1101/2023.06.07.544109},
88
- publisher = {Cold Spring Harbor Laboratory},
89
- journal = {bioRxiv}
90
- }
 
2
  license: mit
3
  library_name: transformers
4
  datasets:
5
+ - CCDS
6
+ - Ensembl
7
+ pipeline_tag: feature-extraction
8
  tags:
9
  - protein language model
10
  - biology
11
  widget:
12
  - text: >-
13
+ ( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C \# L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v ]
14
+ example_title: Example CCDS embedding extraction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
 
 
 
16
 
17
+ # cdsBERT
18
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">
19
 
20
+ ## Model description
21
 
 
22
 
23
 
24
  ## How to use
 
28
  import re
29
  import torch
30
  import torch.nn.functional as F
31
+ from transformers import BertForMaskedLM, BertTokenizer
32
 
33
+ model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
34
+ tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
35
  device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
36
  model.to(device) # move to device
37
  model.eval() # put in eval mode
38
 
39
+ sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
40
+ sequence = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence))) # need spaces in-between amino acids, replace rare amino acids with X
 
 
 
41
 
42
+ example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
43
  with torch.no_grad():
44
+ matrix_embedding = model(**example).last_hidden_state.cpu()
45
 
46
+ vector_embedding = matrix_embedding.mean(dim=0)
 
47
  ```
48
 
49
  ## Intended use and limitations
50
+
51
 
52
  ## Our lab
53
+ The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.
54
 
55
  ## Please cite
56
+ Coming soon!