8x7B (Q3) vs 7B

#12
by vidyamantra - opened

Since both 8x7B (Q3) vs 7B would fit in GPU RAM of 24G. What would be more accurate? What is easy way to test?

Performance wise 8x7B (Q3) is 83 t/s and 7B is 129 t/s on RTX 4090. As soon as we switch to 8x7B (Q4) it exceeds 24G GPU RAM and hence drop to 27 t/s.

@vidyamantra a bigger quantized model is always better then a smaller unquantized model.
so use the 8x7b q3 if you want better quality

@YaTharThShaRma999 I don't think this is always true, we should do benchmarks!

@shroominic Here is the bench mark useful to you https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md

If you want to run locally, you can build llamacpp and run ppl score on gguf models

@vidyamantra a bigger quantized model is always better then a smaller unquantized model.
so use the 8x7b q3 if you want better quality

@YaTharThShaRma999 That doesn't seem true for things like MemGPT. Perhaps not true for RAG in general?

A very simple (amateurish) test. Given a multiple choice questions, model was asked to give suitable drilled down subject tags. Total 109 questions were given and result was compared with already computed correct answers. Score was given based on how much accurate was drilled down tags.

Model Average Score Correct Tags
gpt-3.5-turbo 35.65834862 82/109
gpt-3.5-turbo-instruct 32.0842605 72/109
mixtral-8x7b-instruct-v0.1.Q5_K_M 25.17394495 59/109
mixtral-8x7b-instruct-v0.1.Q6_K 23.64691589 60/109
mixtral-8x7b-instruct-v0.1.Q8_K 23.17743119 59/109
mixtral-8x7b-instruct-v0.1.Q4_K_M 23.06449541 60/109
mistralai_Mistral-7B-Instruct-v0.1 20.99638889 51/109
mixtral-8x7b-instruct-v0.1.Q3_K_M 20.74944444 49/109
Mistral-7B-Instruct-v0.2 18.59256881 59/109
upstage_SOLAR-10.7B-Instruct-v1.0 15.88376147 52/109
microsoft_phi-2 4.715196078 14/109

I will try to write a better test. Any pointers?

Sign up or log in to comment