Is SWA used during pertaining?
#113
by
EarthWorm001
- opened
I have two questions about sliding window attention (SWA):
1: Is the SWA used during pertaining all the time? I mean, in every pretraining step.
2: If not, is the SWA used during pertaining in a certain stage? For example, pretrain the model with the full-attention for sometime then use SWA to do pretraining.
Thanks!
EarthWorm001
changed discussion title from
If SWA used during pertaining?
to Is SWA used during pertaining?