|
--- |
|
tags: |
|
- antibody language model |
|
- antibody |
|
base_model: Exscientia/IgT5_unpaired |
|
license: mit |
|
--- |
|
|
|
# IgT5 model |
|
|
|
Pretrained model on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper [Large scale paired antibody language models](https://arxiv.org/abs/2403.17889). |
|
|
|
The model is finetuned from IgT5-unpaired using paired antibody sequences from paired OAS. |
|
|
|
# Use |
|
|
|
The encoder part of the model and tokeniser can be loaded using the `transformers` library |
|
|
|
```python |
|
from transformers import T5EncoderModel, T5Tokenizer |
|
|
|
tokeniser = T5Tokenizer.from_pretrained("Exscientia/IgT5", do_lower_case=False) |
|
model = T5EncoderModel.from_pretrained("Exscientia/IgT5") |
|
``` |
|
|
|
The tokeniser is used to prepare batch inputs |
|
```python |
|
# heavy chain sequences |
|
sequences_heavy = [ |
|
"VQLAQSGSELRKPGASVKVSCDTSGHSFTSNAIHWVRQAPGQGLEWMGWINTDTGTPTYAQGFTGRFVFSLDTSARTAYLQISSLKADDTAVFYCARERDYSDYFFDYWGQGTLVTVSS", |
|
"QVQLVESGGGVVQPGRSLRLSCAASGFTFSNYAMYWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRFTISRDNSKNTLYLQMNSLRTEDTAVYYCASGSDYGDYLLVYWGQGTLVTVSS" |
|
] |
|
|
|
# light chain sequences |
|
sequences_light = [ |
|
"EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK", |
|
"ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL" |
|
] |
|
|
|
# The tokeniser expects input of the form ["V Q ... S S </s> E V ... I K", ...] |
|
paired_sequences = [] |
|
for sequence_heavy, sequence_light in zip(sequences_heavy, sequences_light): |
|
paired_sequences.append(' '.join(sequence_heavy)+' </s> '+' '.join(sequence_light)) |
|
|
|
tokens = tokeniser.batch_encode_plus( |
|
paired_sequences, |
|
add_special_tokens=True, |
|
pad_to_max_length=True, |
|
return_tensors="pt", |
|
return_special_tokens_mask=True |
|
) |
|
``` |
|
|
|
Note that the tokeniser adds a `</s>` token at the end of each paired sequence and pads using the `<pad>` token. For example a batch containing sequences `V Q L </s> E V V`, `Q V </s> A L` will be tokenised to `V Q L </s> E V V </S>` and `Q V </s> A L </s> <pad> <pad>`. |
|
|
|
|
|
Sequence embeddings are generated by feeding tokens through the model |
|
|
|
```python |
|
output = model( |
|
input_ids=tokens['input_ids'], |
|
attention_mask=tokens['attention_mask'] |
|
) |
|
|
|
residue_embeddings = output.last_hidden_state |
|
``` |
|
|
|
To obtain a sequence representation, the residue tokens can be averaged over like so |
|
|
|
```python |
|
import torch |
|
|
|
# mask special tokens before summing over embeddings |
|
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0 |
|
sequence_embeddings_sum = residue_embeddings.sum(1) |
|
|
|
# average embedding by dividing sum by sequence lengths |
|
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1) |
|
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1) |
|
``` |
|
|