Human Eval and Human Eval-X score?

#2
by whatever1983 - opened

I don't know why people love the "French" mistral 7B model so much. It is an European approach to be a jack of all trades but master of none model. Outputs are unnatural, I would definitely use the 13B models instead of 7B. Look at xwin 13B, almost GPT4 level, quantized to Q4K_M, is only 7.9GB. Even phones can just use 4GB model memory to upgrade to 13B.

Using the Mistral 7B as base(HumanEval = 30ish), I wonder the 80K evol instruct would bring it above HumanEval=40. Why not try Evo-Instruct on either Phi-1(already at HumanEval 50.6 or TinyLlama 1.1 to see if you can get Human Eval to > 60( which is WizardCoder-13B levels)

Xwin 13b ? You mean xwinLM-13b ?
Whats the humaneval for mistral code model anyway ?

@Pumba2 no humaneval atm
I will work on it today but dunno how to do it yet.

@whatever1983 its because mistral is "smarter" then 13b models. xwin 13b is gpt4level in ALPACA eval only. Thats one eval and its a good eval but on other evals xwin is lower.

The model has enough common sense to solve riddles that even 13b models have hard time solving. True that 13b llama models are probably better for chat and rp but still mistral is coherant and actually really well overall compared to other models that excel at one taks but suck at another.

Phi 1.5 is garbage at anything thats not "textbook" like and tinyllama will not magically reach 60 at humaneval even with evol instruct. Its good for its size but still much worse than llama 7b.

The reason wizardcoder has such a high score is because it uses codellama not llama. If you trained mistral on the same amount of code. 99% it will be better than wizardcoder 13b.

I started humaneval testing,

Nondzu mistral7b-code 
Base
{'pass@1': 0.3353658536585366}
Base + Extra
{'pass@1': 0.2804878048780488}

to compare here is original Mistral model tested on the same machine

Mistral 7b
Base
{'pass@1': 0.2926829268292683}
Base + Extra
{'pass@1': 0.24390243902439024}
python codegen/generate.py --bs 1 --temperature 0 --dataset humaneval --model mistral-7b --root /root/kamil/workdir  --resume    --greedy

I will do test for 200 examples and few temperature settings

image.png

Sign up or log in to comment