My alternative quantizations.

#1
by ZeroWw - opened

These are my own quantizations (updated almost daily).

The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs

Salesforce org

Hi @ZeroWw , thanks for your contributions! We appreciate that!

BTW, we also have the Q3, Q5, Q6 versions ready, but didn't upload to the repos. If people like you find that these quantized versions are useful, we can also upload them to the official repo. Thanks!

Hi @ZeroWw , thanks for your contributions! We appreciate that!

BTW, we also have the Q3, Q5, Q6 versions ready, but didn't upload to the repos. If people like you find that these quantized versions are useful, we can also upload them to the official repo. Thanks!

I think that anything less than q5_k degrades too much the models, but that's only my own opinion. q6k/q5_k seems to be the sweet spot (keeping output and embed at f16)

Sign up or log in to comment