Is SWA used during pertaining?

#113
by EarthWorm001 - opened

I have two questions about sliding window attention (SWA):

1: Is the SWA used during pertaining all the time? I mean, in every pretraining step.
2: If not, is the SWA used during pertaining in a certain stage? For example, pretrain the model with the full-attention for sometime then use SWA to do pretraining.

Thanks!

EarthWorm001 changed discussion title from If SWA used during pertaining? to Is SWA used during pertaining?

Sign up or log in to comment