Inconsistency in effective batch size reporting
In the model card you state that it was trained with a world size of 1024 and a micro batch size of 1 but in the training hyperparameters section you write effective batch size 4M tokens (2048x2048) instead of (1024x2048). Maybe there was a data entry error somewhere.
The effective batch size we're referring to is just the product of the global batch size and the sequence length, or in this case 2048*2048=4194304. We’re running sequence and tensor parallel with gradient accumulation.
Oh I understand, must have missed the mention of gradient accumulation. Thanks for clarifying! Perhaps it might be helpful to include this in the table (gradient accumulation steps = 2
).
I added a note to the training about using GAS=16. Thanks for the feedback!