MaziyarPanahi/TinyMistral-248M-SFT-v4-GGUF · Why different sizes for same quants?

Why different sizes for same quants?

by supercharge19 - opened Feb 3

Feb 3

Why and how your model is smaller than: https://huggingface.co/Felladrin/gguf-TinyMistral-248M-SFT-v4/tree/main

is it more efficient than the other quants as mentioned above, does it trade off speed with accuracy or does it not only benefit in speed but also gains in performance?

MaziyarPanahi

Owner Feb 3

We both are quantizing the same TinyMistral-248M-SFT-v4 for sure. However, I can think of two (possible) reasons even though I am not 100% sure:

I am using the latest llama.cpp, the other GUUF is 2 months old. They do lots of optimizations on a daily basis, so that could be one reason
Second, some people quantized directly without any conversion. I read somewhere it is possible but not as accurate. I personally first convert the model in 16-bit to GGUF, then from that quantized to the rest, and remove the 16 bit before uploading them all. Maybe this has also something to do with the size.

But it is an interesting question!

supercharge19

Feb 3

We both are quantizing the same TinyMistral-248M-SFT-v4 for sure. However, I can think of two (possible) reasons even though I am not 100% sure:

I am using the latest llama.cpp, the other GUUF is 2 months old. They do lots of optimizations on a daily basis, so that could be one reason

Second, some people quantized directly without any conversion. I read somewhere it is possible but not as accurate. I personally first convert the model in 16-bit to GGUF, then from that quantized to the rest, and remove the 16 bit before uploading them all. Maybe this has also something to do with the size.

But it is an interesting question!

First converting to fp16 will save you time for sure, however, do models coming out this way have better perplexity or the same, I suspect their quality will suffer since quantization will happen twice, though I have read it somewhere that quantization sometimes increases performance gains and lowers perplexity.

Meanwhile could you look at Locutusque/TinyMistral-248M-v2.5-Instruct model for quantization?

MaziyarPanahi

Owner Feb 3

First converting to fp16 will save you time for sure, however, do models coming out this way have better perplexity or the same, I suspect their quality will suffer since quantization will happen twice, though I have read it somewhere that quantization sometimes increases performance gains and lowers perplexity.

It's not quantized twice, it is once converted purely to 16-bit but GGUF format, then from there it is quantized to all the other types but always from the 16-bit which is the best accuracy. (recommended by the team behind llama.cpp, but going to 16, then use that to quantized to the rest can be expensive in resources so some skip it)

Locutusque/TinyMistral-248M-v2.5-Instruct seems like an interesting merge, I can do it in an hour and share

MaziyarPanahi

Owner Feb 3

@supercharge19 https://huggingface.co/MaziyarPanahi/TinyMistral-248M-v2.5-Instruct-GGUF

supercharge19

Feb 4

@supercharge19 https://huggingface.co/MaziyarPanahi/TinyMistral-248M-v2.5-Instruct-GGUF

Found it, thanks. But it is changed format (ggufm) how to run this (I am using llama.cpp)

MaziyarPanahi

Owner Feb 4

I found the issue, I just named it wrongly! Those are GGUF models, I just added an m at the end by mistake. I'll fix thme

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment