Reset attention mask across doc boundary
#14
by
jimmyhbx
- opened
Hi,
Thanks for sharing this great model. I am wondering if we want to continue pretrain Llama 3 and also reset the attention mask to avoid cross doc attention, should we also reset the position ID in the rotary embedding?
This comment has been hidden
I think there is no need to reset the position ID in the pre-training stage, since docs longer than 8k are limited.
The implementation of cross doc attention-mask is available in Flash-attention, https://github.com/Dao-AILab/flash-attention/issues/432#issuecomment-1698610752.
I just realized that resetting the position ID after the doc boundary is identical to not resetting. Because the rotary embedding refers to relative position.