nferruz/ProtGPT2 · There is an 'X' in generated sequences.

Apr 24, 2023

•

edited Apr 24, 2023

Dear Noelia,

When I analyze the generated sequences with ProtGPT2, some sequences have 'X' as an amino acid (either in the beginning or in the middle). Could you please let me know how I should interpret this character?

Thanks!

SweetGAN changed discussion status to closed Apr 24, 2023

SweetGAN changed discussion status to open Apr 24, 2023

nferruz

Owner Apr 25, 2023

Hi SweetGAN,

The UniRef database contains sequences with 'X' as an amino acid (actually, it appears pretty frequently somehow!). Hence the model has learned that this token sometimes appears in the set and when it appears, and it generates sequences that resemble that distribution.
What I recommend is always to compute the perplexity and only select the best 5-10% for each generation batch (or be even more restrictive if you can). This way, you ensure the best possible sequences from the model. If I am not wrong, those should have a lower proportion of the 'X' amino acid.