Why the lower temperature?
Unlike previous Mistral models, Mistral Nemo requires smaller temperatures. We recommend to use a temperature of 0.3.
Is it because the vocab size is much bigger now? Or is there a different reason.
It works better in our experiments!
usually the higher the model is trained the lower the tempreture !
the more the model is raw ! ie a base model then it will need a higher tempreture to get a good response as data is still very big loss rate ! ( perhaps 2-3 even 7 )
when its trained model ( ie loss = 0.9- ) then you will need a much lower tempreture to extract a good answer :
so the better the model the lower the temp !
when you give the model a large temp it will get more optional respoinses trained at higher rates of loss !
hence ground truths should be trained at much lower rates ! (overfit)
and general knowledge whcih even may change can be done at +1 loss Even 2 loss is acceptable , ( it is harder to retrive but it is there ! ) (underfit)
so for a model Specialized in a task Expect it to be more towrds the overfit spectrum !
The question should be is what prompt was used during training and which task was this model trained for and which prompts will revela the best from the model
the highest prompts used will be the best to extract this new tuned data !
When fine tuning its important to keep track of your maoin prompt structures as these are your ( internal ( FUNCTIONS) Knowledge and methods !)
hence if your having problem with a rag prehaps you need to train that specific prompt ! << and it will now yeild responses : they werre actually in the model but they were discconected from a Prompt !
hence corpus training trains a model bu the data is disaasociated from a prompt ! sop it will need more Prompt tuning on the same data to reveal it :
ie there i no pointin throwing the bible in the model without bible associated tasks ! <<<< you may not ever get a good result based on this content @)