Why different sizes for same quants?

#1
by supercharge19 - opened

Why and how your model is smaller than: https://huggingface.co/Felladrin/gguf-TinyMistral-248M-SFT-v4/tree/main

is it more efficient than the other quants as mentioned above, does it trade off speed with accuracy or does it not only benefit in speed but also gains in performance?

We both are quantizing the same TinyMistral-248M-SFT-v4 for sure. However, I can think of two (possible) reasons even though I am not 100% sure:

  • I am using the latest llama.cpp, the other GUUF is 2 months old. They do lots of optimizations on a daily basis, so that could be one reason
  • Second, some people quantized directly without any conversion. I read somewhere it is possible but not as accurate. I personally first convert the model in 16-bit to GGUF, then from that quantized to the rest, and remove the 16 bit before uploading them all. Maybe this has also something to do with the size.

But it is an interesting question!

We both are quantizing the same TinyMistral-248M-SFT-v4 for sure. However, I can think of two (possible) reasons even though I am not 100% sure:

  • I am using the latest llama.cpp, the other GUUF is 2 months old. They do lots of optimizations on a daily basis, so that could be one reason
  • Second, some people quantized directly without any conversion. I read somewhere it is possible but not as accurate. I personally first convert the model in 16-bit to GGUF, then from that quantized to the rest, and remove the 16 bit before uploading them all. Maybe this has also something to do with the size.

But it is an interesting question!

First converting to fp16 will save you time for sure, however, do models coming out this way have better perplexity or the same, I suspect their quality will suffer since quantization will happen twice, though I have read it somewhere that quantization sometimes increases performance gains and lowers perplexity.

Meanwhile could you look at Locutusque/TinyMistral-248M-v2.5-Instruct model for quantization?

First converting to fp16 will save you time for sure, however, do models coming out this way have better perplexity or the same, I suspect their quality will suffer since quantization will happen twice, though I have read it somewhere that quantization sometimes increases performance gains and lowers perplexity.

It's not quantized twice, it is once converted purely to 16-bit but GGUF format, then from there it is quantized to all the other types but always from the 16-bit which is the best accuracy. (recommended by the team behind llama.cpp, but going to 16, then use that to quantized to the rest can be expensive in resources so some skip it)

Locutusque/TinyMistral-248M-v2.5-Instruct seems like an interesting merge, I can do it in an hour and share

@supercharge19 https://huggingface.co/MaziyarPanahi/TinyMistral-248M-v2.5-Instruct-GGUF

Found it, thanks. But it is changed format (ggufm) how to run this (I am using llama.cpp)

I found the issue, I just named it wrongly! Those are GGUF models, I just added an m at the end by mistake. I'll fix thme

Sign up or log in to comment