Embedding sequences

#12

by flpgrz - opened Jan 23, 2023

Discussion

flpgrz

Jan 23, 2023

•

edited Jan 23, 2023

Hello,

Thanks for making this model available!

I have been trying to embed sequences (of different lengths), by using the following code:

inputs = tokenizer(['CASSPRAGGITDTQYF', 'CASSLLQPFGTEAFF'], return_tensors="pt", padding=True)
outputs = model(**inputs, output_hidden_states=True)
embeddings = outputs.hidden_states[0] #embedding before final fc layer

The two exemplary sequences have a different length and give a different number of tokens. Hence, padding is needed (padding=True).

However, I get the following error:
ValueError: Asking to pad but the tokenizer does not have a padding token.

This makes me think that padding was not used at training time, as the tokenizer does not have a padding token.
How did you concatenate proteins of different lengths to create a batch at training time without padding?

Thanks for your help.

nferruz

Owner Jan 23, 2023

Hi flpgrz,

I did not pad in ProtGTP2 because the sequences were truncated across groups. This is something I did not like and modified in ZymCTRL, which has a padding token.
In any case, I think you can add a padding token on the fly; could you try this?

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

nferruz

Owner Jan 23, 2023

This issue could also be useful: https://github.com/huggingface/transformers/issues/3021

flpgrz

Jan 23, 2023

I understand. Thanks for clarifying.

I might be wrong, but I think adding the padding token in the tokeniser step might not work, because the model does not know how to process it. But I should first try.

What I did so far is to embed one sequence at a time and do 0-padding afterwards to account for the different lengths

nferruz

Owner Jan 23, 2023

Yes, I think you are right. I think they talk about this issue in the GitHub issue I sent: https://github.com/huggingface/transformers/issues/3021
But I haven't tested it myself. Let me know if it works!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment