very interseting as this time they DID update the codebase So it is a new model !
Forget the training !!!
most important is the codebase changes and context exteensions and sliding window implementaions as awell as rotary and scalled embeddings , they have not added the ring embeddings yet !
intesting againn is that ALL models are generally clones of the llama codebase !!
so they all enjoy incresed capabilitys :
mistral actually copied the llama codebase 100% with no changes !!!
Obviousy check out the codebases in the trnsformer library !
but in general the mistral 7b Will still outperform them as the NUMBERS are correct !
the llama 3 and all this models are released with BAD numbers with pure mismatches ! ( this is the trick when you want to release open source models and NOT share the capabilitys with the public ! ( in fact they are supposed to know the right numbers and generate a model for them self and pretrain yourself !
or they would be releasing a comerically READY! model !
the comerically Ready Models (guarded ) are kept on the company hosts !!!
So go and generate a model with the correct values and you will have a good model ! - ( mistral also realized this and released nemo ( 5120 hidden size (this is a bomb to the model )<<<< 5120 hidden size does noit follow ANY convention and cannot even factor down by 2 ! to a standard bit or byte size !<
hence all mathmatical operations (training and tensor calcs ) will be intensive and unnnatural breeding unnatural numbers for the model !( hence bad performance ) ----
So Pretrianing is a waste !