Remove extra degrees of freedom by dequantizing the `q5_K_M`, `q4_K_M` and `q2_K` models together?

#18

pinned

by jukofyork - opened Feb 8

Discussion

jukofyork

Feb 8

•

edited Feb 8

Was this dequantized from just the q5_K_M model?

If so, then it should (in theory) be possible to improve the dequantization by using the q5_K_M, q4_K_M and q2_K models together. They likely each contain information about the original fp16 model they came from and should help remove extra degrees of freedom for the dequantization process...

I don't really know the maths used for the quantization, but I guess the dequantization was something along the lines of using a Moore-Penrose pseudoinverse of some kind of "standard" GGUF quantization matrix of dimension (2^16 x 2^5)? If so then it should be possible to combine all 3 of the (2^16 x 2^5), (2^16 x 2^4) and (2^16 x 2^2) pseudoinverses and the optimal least-squares weighting factors could be found empirically be dequanitizing known fp16 models from their q5_K_M, q4_K_M and q2_K variants .

It would probably require writing some custom code, but if this model is a good as it seems to be then it would probably be worthwhile!?

152334H

Owner Feb 8

Yeah, I don't see why this wouldn't work either.

152334H pinned discussion Feb 8

jukofyork

Feb 10

Can you also dequantize and upload the q4_K_M and q2_K models using the same method?

I've been thinking about this some more over the last few days and I think there is actually a really easy way to do this:

If we know the quantization criteria used (least squares, equal numbers in each bin, ect) then it should be quite easy to estimate the mean and standard deviation of the quantization errors for 2^5, 2^4 and 2^2 bins receptively. These should have mean ~0 and standard deviations: 2^2 >> 2^4 > 2^5. Then for every triplet of quantized weights we can sit the estimated normal on top and calculate the maximum-a-posteriori estimate of the mean for the fp16 weight (see the "Posterior hyperparameters" column: https://en.wikipedia.org/wiki/Conjugate_prior).

BUT: If we just assume each bin is a uniform distribution created so as to have equal numbers of values fall into it, you don't actually even need to do the above and can calculate the correct weighting factor for each of the 3 different quantized models and then use Mergekit (https://github.com/arcee-ai/mergekit) to create a linear combination.

This can be refined to assume the original fp16 weights were distributed normally (ie: trapeziums or sections of the normal PDF) and for different quantization criteria, but I suspect this won't actually make that much difference.

jukofyork

Feb 15

This comment has been hidden

Noeda

Mar 27

@jukofyork @152334H Hey, just stumbled upon here. I know the last comments were from mid-February, but do you know if this idea of using the different quants files to do a more accurate dequant went anywhere or if anyone tried the ideas?

I was involved in a similar discussion but with the Grok model by x.com. They released an 8-bit quant, which got dequanted to f16 by various authors, which then got quanted back to .gguf quants and because of the way the quanting is implemented in llama.cpp, that process loses some precision compared to original Grok 8-bit weights even if you go with Q8 gguf quant, even though the Q8 gguf quant scheme is more expressive than the original 8-bit quant scheme the Grok team used.

I'm interested in maybe try to run with this idea and write some custom quant/dequant code that's custom tailored and stumbled here when trying to find people or discussion to see if anyone has tried similar ideas.

Noeda

Mar 27

If we know the quantization criteria used (least squares, equal numbers in each bin, ect) then it should be quite easy to estimate the mean and standard deviation of the quantization errors for 2^5, 2^4 and 2^2 bins receptively.

I don't know about the smaller quant sizes, but Q8 llama.cpp quantization is pretty simple: https://github.com/ggerganov/llama.cpp/blob/e562b9714b9b3e242361a7f74bbbeb00f6bd99ac/ggml-quants.c#L727-L750 compute absolute max, store the scale as f16, bucket size 32, encode weights as 8-bit values from -128 to 127.

I haven't studied it but e.g. Q5_K https://github.com/ggerganov/llama.cpp/blob/e562b9714b9b3e242361a7f74bbbeb00f6bd99ac/ggml-quants.c#L2537 seems a little bit more sophisticated. I would imagine the original team that made the models wouldn't have bothered to write custom quanting code and just used what was in llama.cpp at the time.

jukofyork

Mar 27

Hi @Noeda ,

I never looked any further into this as saw that under some distributional assumptions the lower quants don't actually give any extra information:

If the weights were sampled from a uniform distribution then whatever quantization criteria you use, the bin boundaries of all the powers of 2 will approximately overlap.

This would then mean that knowing which of the 32 bins a q5 weight fell into tells you exactly which of the q4 and q2 bins it is in already, and there is no extra information to extract from knowing the q4 or q2.

In reality the weights are likely to come from a Gaussian distribution and different quantization criterion will also effect the bin boundaries too.

I think a quick and dirty test to see if this is worth perusing would be to dequantize both the q5 and q4 and then see how closely the q5 bin centres are approximated by the mean of consecutive pairs if q4 bin centres.

jukofyork

Mar 27

I was involved in a similar discussion but with the Grok model by x.com. They released an 8-bit quant, which got dequanted to f16 by various authors, which then got quanted back to .gguf quants and because of the way the quanting is implemented in llama.cpp, that process loses some precision compared to original Grok 8-bit weights even if you go with Q8 gguf quant, even though the Q8 gguf quant scheme is more expressive than the original 8-bit quant scheme the Grok team used.

I don't know about the smaller quant sizes, but Q8 llama.cpp quantization is pretty simple: https://github.com/ggerganov/llama.cpp/blob/e562b9714b9b3e242361a7f74bbbeb00f6bd99ac/ggml-quants.c#L727-L750 compute absolute max, store the scale as f16, bucket size 32, encode weights as 8-bit values from -128 to 127.

How does Grok store its quantized values? If it's the same offset and scale but with different bucket sizes then you might be able to keep the 8 bit data as is, and just reoptimize the max and scale values?

Noeda

Mar 28

(I thought I wrote a comment here this morning but when I checked it's gone. Did I forget to click "Comment"? Ugh. Okay well let me type it again 😅)

Thanks for a quick answer @jukofyork !

How does Grok store its quantized values? If it's the same offset and scale but with different bucket sizes then you might be able to keep the 8 bit data as is, and just reoptimize the max and scale values?

Grok has this class in its original Jax code (roughly, going off memory):

class Quantized8bitWeights:
    weight: [int8]
    scale: [f16]

Where len(weight) / len(scale) IIRC was typically 786 or 1024. So "block size" (they don't mention this term in their code) is fairly large (GGUF Q8 has a block size of 32). Shape is included in the weight. At inference time, they do weight * scale to convert it to higher precision for computing.

It's the same scheme as Q8 GGUF, just a different block size. Your intuition is correct (assuming my intuition is also correct); if you keep 8-bit data as-is, it'll be exactly same if you go Grok Q8 -> GGUF Q8. It's just that in practice, people go dequant to f16, and then back to Q8 GGUF and the current algorithm as it's written in llama.cpp code will not encode it in such a way that you'll get the same original weights back as you would with the original Grok Q8 code. It doesn't seem like some massive precision loss, and IMO not worth spending time for a single model that doesn't seem that good; but you do get a bit of loss. A few days earlier I did a quick and dirty not-very-scientific test with this using synthetic data simulating the Grok Q8 -> F16 -> GGUF Q8 -> F16 conditions rather than real weight data so sort-of confirmed it experimentally. But if someone bothered to write the custom code, that would fix it.

I lost my motivation to work on anything Grok after I tried the full precision version of it on a big 10xA40 machine and thought it was crappy even considering it as a non-finetuned base model, so I'll let that ghost just lie.

But then I stumbled here, I got curious about the question you were posing, in my head: "is it possible to combine information from Q2, Q4 and Q5 to make a more accurate dequant, or even improve the Q2, Q4 and Q5?"

I never looked any further into this as saw that under some distributional assumptions the lower quants don't actually give any extra information:

That's a bummer. I was thinking myself trying to exploit the particular way llama.cpp code makes its various quants and try to exploit any quirks in its scheme, so coming from a bit more mechanical standpoint than probability theory standpoint. Some ideas like e.g. it seems like it's possible to give strict lower and upper bound what a true value of a weight is, individually for each of the Q2, Q4 and Q5 models, for some weights 100% exactly, and possibly the intersection of those ranges can tell you more about the true weight. And maybe that's a help. Or not. No idea. (that was one question I was trying to Google and find discussions to see if someone had studied this already and thus engaged here).

I still haven't studied the GGUF quants other than the simple Q8 but they don't seem that different in spirit. Probably a project for me this evening to check. Also as I was writing this comment, I noticed Databricks released some new big model that I want to check out too.

My current intuition would be that there might be something to do to a make it a little bit more accurate using information from all Q2+Q4+Q5, but probably nothing that would actually materially matter in terms of model output quality. If you already thought about this from the probability mathematics perspective and got the conclusion it's probably nothing good then that does not bode well.

I do find it a curious question in general though. I might try something here but not sure. I started this branch for llama.cpp today and wrote a very dumb and hacky Python script to prepare for weird merges that don't involve a materialized f16/f32 dequant midway: https://github.com/Noeda/llama.cpp/tree/octopus-quant-merge-experiments (the code doesn't actually do anything, just made the skeleton to get all the quants and stuff in nice Python+Numpy format for quick experimenting rather than having to mess with C++ code). I don't believe in big improvements in dequant quality but I might engage in this for curiosity's sake. If I do something on this topic, I'll ping here in this thread if I think I have anything worth showing off.

Thanks again @jukofyork for a quick response. If you are interested in this topic and happen to have ideas or thoughts I'm open to discuss it further.

jukofyork

Mar 28

•

edited Mar 28

Hi @Noeda ,

It's definitely not certain that they can't be combined, but to better explain why I think it's less likely consider this simple example:

If I divide the number line into 4 ranges:

1 : [0, 0.25]
2 : [0. 25, 0.5]
3 : [0.5, 0.75]
4 : [0.75, 1]

and again into 2 ranges:

A : [0, 0.5]
B : [0. 5, 1]

and now I generate a uniform random number in the range [0, 1] and tell you the number lies in range 2.

It makes no difference if I also tell you it lies in range A and gives no extra information.

The same is true for a uniform distribution and all powers of 2 divisions like this (where q5 = 2^5 divisions and q4 = 2^4 divisions).

Whereas if I had divided into 3 like so:

A : [0, 1/3]
B : [1/3, 2/3]
C : [2/3, 1]

now if I tell you the unknown uniform random number lies in range 1 and range A we also don't get any extra information knowing about range A.

but if I tell you it lies in range 2 and range A we can deduce it must lie in the range [0.25, 1/3] and we do get extra information from knowledge of A!

BUT: In reality the numbers are likely to be drawn from a Normal (bell shaped) distribution and there are several different possible error criteria used to quantize the values (eg: minimise the sum of squared errors from bin centre, minimise the maximum absolute error over the whole bin range, place equal numbers of values in each bin, and so on), and in this case the simple example above likely isn't appropriate as the different bins will have different concentrations of probability mass, and you have to use Bayes Theorem and integration to work out the most likely bin centres (or empirically using Monte Carlo simulation to answer the question "given we fell into bin n and bin m, what was the most likely value we quantized, given that we followed process x to generate the bins").

Probably the simplest hacky method of just trying different weighted convex combinations of dequantized values from q5 and q4 (and possibly even q2), and trying to optimise against llama.cpp perplexity EXE would be a good start.

Then if this shows promise possibly try to rip out the quantization algorithm used and use brute force with Monte Carlo simulation.

Any mathematical method of doing this will likely have to make lots of (unrealistic) assumptions anyway, so Monte Carlo simulation is likely the best bet anyway. If the sample sizes were tiny then a mathematical approach might have more appeal, but in this case it doesn't make much sense.

Noeda

Mar 28

Gotcheronis. Yeah, your reasoning with the 2^N bucket sizes makes sense. The intersections would be useless. I was reading the Q2, Q4 and Q5 code in llama.cpp yesterday and all of them use the same bucket size, and it's all powers of 2. However, they do have some other differences, the scale computing looks a bit different between Q2 and Q5; but I need to lay out their code for myself in a bit more simplified pseudo-code form to see if there's anything that's exploitable there or if it's just superficial differences. There's also a lot of bit fiddling in there that makes it a bit hard to intuitively understand exactly what are they doing. They have a concept of a "super block" and then a smaller "block" inside the super blocks; which is more sophisticated than the GGUF Q8 that I studied earlier.

I worked yesterday a bit more on the llama.cpp fork with my Python script and got into a state where I can make a new .gguf from the existing quants (no intermediate materialized dequant step, unless you explicitly do that at run-time in the script), and also a loop where it's easy to plug in a "training" scheme that gets Numpy access to all the tensors in the quants and a "ground truth" to target (in this case, I'm using llama2 model as the testbed because the original weights are available, and it has the same architecture, so I have a full-precision llama2 .gguf for ground truth and created smaller llama2 quants to test with). All the source tensors are mmapped so it starts cooking quite fast too; I have a 192GB machine so all the models fit in my disk cache nicely.

One missing piece is perplexity testing; haven't tried the llama.cpp perplexity software but I see it routinely used for testing in PRs so I just need to get that set up. Computing MSE dequant error is easy within the script but maybe that is less correlated with perplexity as we'd like.

Probably the simplest hacky method of just trying different weighted convex combinations of dequantized values from q5 and q4 (and possibly even q2), and trying to optimise against llama.cpp perplexity EXE would be a good start.

The first idea I wanted to try I think is fairly similar to what you are suggesting; I was thinking a type of very dumb and simple linear combination of the dequantized values from the different quants, optimize for minimum MSE against the ground truth, and see is there any measurable improvement over the Q5 quant. That would be fast to optimize and fast to generate the `.gguf, and I have the knowledge how to code this quickly. So it would serve to validate if there's any hope at all or I need to type RIP.

And then hope really hard I didn't just waste a lot of time 😅 well maybe if this doesn't work I'll have my script for future work, e.g. if something is released that has a situation somewhat similar with the Grok team model and I feel like I want more accurate quants; I'll have a script I'm familiar with and can hack with quickly.

Also I wanted to draw graphs. Error graphs. Study distributions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment