Inconsistency in effective batch size reporting

by bjoernp - opened Nov 13, 2023

Nov 13, 2023

•

edited Nov 13, 2023

In the model card you state that it was trained with a world size of 1024 and a micro batch size of 1 but in the training hyperparameters section you write effective batch size 4M tokens (2048x2048) instead of (1024x2048). Maybe there was a data entry error somewhere.

jonabur

LumiOpen org Nov 13, 2023

The effective batch size we're referring to is just the product of the global batch size and the sequence length, or in this case 2048*2048=4194304. We’re running sequence and tensor parallel with gradient accumulation.

bjoernp

Nov 13, 2023

Oh I understand, must have missed the mention of gradient accumulation. Thanks for clarifying! Perhaps it might be helpful to include this in the table (gradient accumulation steps = 2).

bjoernp changed discussion status to closed Nov 13, 2023

jonabur

LumiOpen org Nov 13, 2023

I added a note to the training about using GAS=16. Thanks for the feedback!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment