Can you provide more details on the training?

#10
by dequ777 - opened

I try to reproduce the result on 700M bitnet b1.58b model, but I failed.
Instead of being S-shaped, the loss curve showed an exponential decay. The ppl of the final model was 18.7, but in the paper it was 12.87, and the 700M model you provided was also achievable.
I think my training setup is exactly the same as the paper, but I don't know exactly how the training set RedPajama-100B was generated, I need more details about the training and dataset

I have met the same problem, my loss curve declined quickly when training a 1.1B model, and the MFU is very low.

Sign up or log in to comment