GPU used for training

#2
by Locutusque - opened

Hello,
I was wondering if you guys could tell me what GPU was used to train this model.
Thanks,
Sebastian

BEEspoke Data org

Hi! thanks for your interest. So the training was completed in sequential 'phases ' with a few different GPUs (partially to test out whether batch size really mattered or not). Turns out it in this case it didn't really matter as long as the functional batch size (w gradient chkpting) is ~ 128. So most of it was done on a 3080 ti!

Also, a big factor in this being possible was keeping ctx length for pretrain to 1024 (because it could always be RoPE-scaled later if needed)

Sign up or log in to comment