Embedding sequences
Hello,
Thanks for making this model available!
I have been trying to embed sequences (of different lengths), by using the following code:
inputs = tokenizer(['CASSPRAGGITDTQYF', 'CASSLLQPFGTEAFF'], return_tensors="pt", padding=True)
outputs = model(**inputs, output_hidden_states=True)
embeddings = outputs.hidden_states[0] #embedding before final fc layer
The two exemplary sequences have a different length and give a different number of tokens. Hence, padding is needed (padding=True).
However, I get the following error:
ValueError: Asking to pad but the tokenizer does not have a padding token.
This makes me think that padding was not used at training time, as the tokenizer does not have a padding token.
How did you concatenate proteins of different lengths to create a batch at training time without padding?
Thanks for your help.
Hi flpgrz,
I did not pad in ProtGTP2 because the sequences were truncated across groups. This is something I did not like and modified in ZymCTRL, which has a padding token.
In any case, I think you can add a padding token on the fly; could you try this?
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
This issue could also be useful: https://github.com/huggingface/transformers/issues/3021
I understand. Thanks for clarifying.
I might be wrong, but I think adding the padding token in the tokeniser step might not work, because the model does not know how to process it. But I should first try.
What I did so far is to embed one sequence at a time and do 0-padding afterwards to account for the different lengths
Yes, I think you are right. I think they talk about this issue in the GitHub issue I sent: https://github.com/huggingface/transformers/issues/3021
But I haven't tested it myself. Let me know if it works!