mradermacher/model_requests · Request to bring back Q4

yttria

Sep 21

According to my tests, Q4_1 is the most efficient for the quality, using 20% less energy on my computer than Q4_K_M.

mradermacher

Owner Sep 21

•

edited Sep 21

I can believe the energy reduction, but how did you compare the quality, and how does it compare to the Q4_0 that I provide? I'd need a compelling reason to replace Q4_0 by Q4_1, and an even more compelling reason to provide both. The idea behind providing Q4_0 was top provide a fast quant for slow computers. I suspect Q4_1 is simply slower but also a bit higher quality. It certainly is a lot bigger with very little quality increase.

In any case, @nicoboss is currently preparing a quite extensive (quality) benchmark over all quant types (including Q4_1). I will reevaluate all quants based on that as well.

mradermacher

Owner Sep 21

@nicoboss while we are at it, another test I'd like to do on all quants is a speed test, to see how different quants work on cpu vs. gpu, possibly both prompt processing and inferencing. I don't thionk I cna put all this on the model page, but I could link to a guidance page, where people could get an idea of relative speeds vs. quality.

nicoboss

Sep 21

•

edited Sep 21

While I never performed any performance specific benchmarks as part of the quant quality measurement project so far, I did perform multiple evaluation benchmarks of each model from which I measured the quant quality. Thanks to the files creation and last modified time of the evaluation result I was able to extract the following quant performance measurements. The measurement includes running all the evaluation tests inside the benchmark. All tests were performed with the model being stored fully in RAM without offloading any layers to the GPU but using a GPU to improve prompt evaluation (-ngl 0). Further keep in mind that those are multiple choose benchmarks and so the result is more heavily skewed towards prompt evaluation than token generation than would be the case in most normal use cases.

Comparison between Q4_0 and Q4_1

Model	MMLU [s]	ARC easy [s]	ARC challenge [s]	Winogrande [s]
Fook-Yi-34B-32K-v1.Q4_0	329	245	38	79
Fook-Yi-34B-32K-v1.Q4_1	368	274	42	87
Fook-Yi-34B-32K-v1.i1-Q4_0	331	270	39	80
Fook-Yi-34B-32K-v1.i1-Q4_1	360	301	42	87

Conclusion

It seems wrong that Q4_1 performs better than Q4_0. I actually saw the exact opposite. Q4_0 seemed to perform quite a lot better than Q4_1. I monitored the GPU energy consumption during those benchmarks and it was approximately the same no matter the quant so at least in these tests, longer time directly results in more energy consumption for the same task. I do not recommend replacing Q4_0 with Q4_1 from a performance perspective. Keep in mind that I have not considered quality during this comparison.

Future research

I want to come up with some performance tests that are mainly skewed towards token generation instead of prompt evaluation and want to run them CPU only, CPU with offloading computation to GPU (ngl -0), manually offloading layers to GPU, GPU memory RAM overflowing (GGML_CUDA_ENABLE_UNIFIED_MEMORY=1), GPU and using RPC.

Results of all quants

MMLU performance

Filename	Time (seconds)
Fook-Yi-34B-32K-v1.i1-IQ1_S	181
Fook-Yi-34B-32K-v1.i1-IQ2_XXS	202
Fook-Yi-34B-32K-v1.i1-IQ1_M	218
Fook-Yi-34B-32K-v1.i1-IQ2_XS	225
Fook-Yi-34B-32K-v1.i1-IQ2_S	233
Fook-Yi-34B-32K-v1.i1-IQ2_M	243
Fook-Yi-34B-32K-v1.i1-IQ3_XXS	256
Fook-Yi-34B-32K-v1.i1-Q2_K_S	261
Fook-Yi-34B-32K-v1.i1-Q2_K	264
Fook-Yi-34B-32K-v1.i1-IQ3_XS	268
Fook-Yi-34B-32K-v1.Q2_K	271
Fook-Yi-34B-32K-v1.IQ3_XS	275
Fook-Yi-34B-32K-v1.i1-IQ3_S	279
Fook-Yi-34B-32K-v1.i1-Q3_K_S	282
Fook-Yi-34B-32K-v1.IQ3_S	286
Fook-Yi-34B-32K-v1.i1-IQ3_M	286
Fook-Yi-34B-32K-v1.IQ3_M	288
Fook-Yi-34B-32K-v1.Q3_K_S	295
Fook-Yi-34B-32K-v1.i1-Q3_K_M	302
Fook-Yi-34B-32K-v1.Q3_K_M	315
Fook-Yi-34B-32K-v1.i1-IQ4_XS	319
Fook-Yi-34B-32K-v1.i1-Q3_K_L	321
Fook-Yi-34B-32K-v1.IQ4_XS	323
Fook-Yi-34B-32K-v1.Q3_K_L	325
Fook-Yi-34B-32K-v1.Q4_0	329
Fook-Yi-34B-32K-v1.i1-Q4_0	331
Fook-Yi-34B-32K-v1.i1-Q4_K_S	334
Fook-Yi-34B-32K-v1.i1-IQ4_NL	335
Fook-Yi-34B-32K-v1.Q4_K_S	338
Fook-Yi-34B-32K-v1.IQ4_NL	339
Fook-Yi-34B-32K-v1.i1-Q4_K_M	347
Fook-Yi-34B-32K-v1.Q4_K_M	348
Fook-Yi-34B-32K-v1.i1-Q4_1	360
Fook-Yi-34B-32K-v1.Q4_1	368
Fook-Yi-34B-32K-v1.Q5_0	387
Fook-Yi-34B-32K-v1.i1-Q5_0	388
Fook-Yi-34B-32K-v1.i1-Q5_K_S	388
Fook-Yi-34B-32K-v1.Q5_K_S	394
Fook-Yi-34B-32K-v1.i1-Q5_K_M	397
Fook-Yi-34B-32K-v1.Q5_K_M	398
Fook-Yi-34B-32K-v1.Q5_1	416
Fook-Yi-34B-32K-v1.i1-Q5_1	416
Fook-Yi-34B-32K-v1.i1-Q6_K	455
Fook-Yi-34B-32K-v1.Q6_K	456
Fook-Yi-34B-32K-v1.Q8_0	547
Fook-Yi-34B-32K-v1.SOURCE	975

ARC easy performance:

Filename	Time (seconds)
Fook-Yi-34B-32K-v1.i1-IQ1_S	106
Fook-Yi-34B-32K-v1.i1-IQ1_M	117
Fook-Yi-34B-32K-v1.i1-IQ2_XXS	132
Fook-Yi-34B-32K-v1.i1-IQ2_XS	139
Fook-Yi-34B-32K-v1.i1-IQ2_S	148
Fook-Yi-34B-32K-v1.i1-IQ2_M	159
Fook-Yi-34B-32K-v1.Q2_K	165
Fook-Yi-34B-32K-v1.i1-Q2_K_S	170
Fook-Yi-34B-32K-v1.i1-Q2_K	172
Fook-Yi-34B-32K-v1.i1-IQ3_XXS	176
Fook-Yi-34B-32K-v1.IQ3_XS	185
Fook-Yi-34B-32K-v1.i1-IQ3_XS	189
Fook-Yi-34B-32K-v1.IQ3_S	190
Fook-Yi-34B-32K-v1.Q3_K_S	190
Fook-Yi-34B-32K-v1.i1-IQ3_S	198
Fook-Yi-34B-32K-v1.i1-Q3_K_S	203
Fook-Yi-34B-32K-v1.IQ3_M	206
Fook-Yi-34B-32K-v1.Q3_K_M	206
Fook-Yi-34B-32K-v1.i1-IQ3_M	207
Fook-Yi-34B-32K-v1.i1-Q3_K_M	225
Fook-Yi-34B-32K-v1.IQ4_XS	229
Fook-Yi-34B-32K-v1.Q3_K_L	230
Fook-Yi-34B-32K-v1.IQ4_NL	242
Fook-Yi-34B-32K-v1.i1-Q3_K_L	242
Fook-Yi-34B-32K-v1.Q4_K_S	243
Fook-Yi-34B-32K-v1.i1-IQ4_XS	243
Fook-Yi-34B-32K-v1.Q4_0	245
Fook-Yi-34B-32K-v1.i1-IQ4_NL	256
Fook-Yi-34B-32K-v1.SOURCE	257
Fook-Yi-34B-32K-v1.i1-Q4_K_S	262
Fook-Yi-34B-32K-v1.i1-Q4_0	270
Fook-Yi-34B-32K-v1.Q4_1	274
Fook-Yi-34B-32K-v1.Q4_K_M	275
Fook-Yi-34B-32K-v1.i1-Q4_K_M	276
Fook-Yi-34B-32K-v1.Q5_0	287
Fook-Yi-34B-32K-v1.Q5_K_S	295
Fook-Yi-34B-32K-v1.Q5_K_M	301
Fook-Yi-34B-32K-v1.i1-Q4_1	301
Fook-Yi-34B-32K-v1.Q5_1	320
Fook-Yi-34B-32K-v1.i1-Q5_K_S	323
Fook-Yi-34B-32K-v1.i1-Q5_0	332
Fook-Yi-34B-32K-v1.Q6_K	347
Fook-Yi-34B-32K-v1.i1-Q5_K_M	352
Fook-Yi-34B-32K-v1.i1-Q5_1	356
Fook-Yi-34B-32K-v1.i1-Q6_K	386
Fook-Yi-34B-32K-v1.Q8_0	458

ARC challenge performance

Filename	Time (seconds)
Fook-Yi-34B-32K-v1.i1-IQ1_S	23
Fook-Yi-34B-32K-v1.i1-IQ2_XXS	25
Fook-Yi-34B-32K-v1.i1-IQ1_M	26
Fook-Yi-34B-32K-v1.i1-IQ2_S	27
Fook-Yi-34B-32K-v1.i1-IQ2_XS	27
Fook-Yi-34B-32K-v1.i1-IQ2_M	30
Fook-Yi-34B-32K-v1.i1-IQ3_XS	31
Fook-Yi-34B-32K-v1.i1-IQ3_XXS	31
Fook-Yi-34B-32K-v1.Q2_K	32
Fook-Yi-34B-32K-v1.i1-Q2_K	32
Fook-Yi-34B-32K-v1.i1-Q2_K_S	32
Fook-Yi-34B-32K-v1.IQ3_XS	33
Fook-Yi-34B-32K-v1.Q3_K_S	33
Fook-Yi-34B-32K-v1.i1-IQ3_M	33
Fook-Yi-34B-32K-v1.i1-IQ3_S	33
Fook-Yi-34B-32K-v1.i1-Q3_K_S	33
Fook-Yi-34B-32K-v1.IQ3_M	34
Fook-Yi-34B-32K-v1.IQ3_S	34
Fook-Yi-34B-32K-v1.i1-Q3_K_M	35
Fook-Yi-34B-32K-v1.Q3_K_M	36
Fook-Yi-34B-32K-v1.IQ4_XS	37
Fook-Yi-34B-32K-v1.Q3_K_L	38
Fook-Yi-34B-32K-v1.Q4_0	38
Fook-Yi-34B-32K-v1.i1-IQ4_XS	38
Fook-Yi-34B-32K-v1.i1-Q3_K_L	38
Fook-Yi-34B-32K-v1.i1-IQ4_NL	39
Fook-Yi-34B-32K-v1.i1-Q4_0	39
Fook-Yi-34B-32K-v1.i1-Q4_K_S	39
Fook-Yi-34B-32K-v1.IQ4_NL	40
Fook-Yi-34B-32K-v1.Q4_K_S	40
Fook-Yi-34B-32K-v1.Q4_K_M	41
Fook-Yi-34B-32K-v1.i1-Q4_K_M	41
Fook-Yi-34B-32K-v1.Q4_1	42
Fook-Yi-34B-32K-v1.i1-Q4_1	42
Fook-Yi-34B-32K-v1.i1-Q5_0	44
Fook-Yi-34B-32K-v1.Q5_0	45
Fook-Yi-34B-32K-v1.Q5_K_M	45
Fook-Yi-34B-32K-v1.Q5_K_S	45
Fook-Yi-34B-32K-v1.i1-Q5_K_S	45
Fook-Yi-34B-32K-v1.i1-Q5_K_M	46
Fook-Yi-34B-32K-v1.Q5_1	48
Fook-Yi-34B-32K-v1.i1-Q5_1	48
Fook-Yi-34B-32K-v1.i1-Q6_K	52
Fook-Yi-34B-32K-v1.Q6_K	53
Fook-Yi-34B-32K-v1.Q8_0	63
Fook-Yi-34B-32K-v1.SOURCE	110

Winogrande performance

Filename	Time (seconds)
Fook-Yi-34B-32K-v1.i1-IQ1_S	45
Fook-Yi-34B-32K-v1.i1-IQ2_XXS	50
Fook-Yi-34B-32K-v1.i1-IQ1_M	54
Fook-Yi-34B-32K-v1.i1-IQ2_XS	56
Fook-Yi-34B-32K-v1.i1-IQ2_S	57
Fook-Yi-34B-32K-v1.i1-IQ2_M	60
Fook-Yi-34B-32K-v1.i1-IQ3_XXS	63
Fook-Yi-34B-32K-v1.i1-Q2_K_S	64
Fook-Yi-34B-32K-v1.i1-IQ3_XS	65
Fook-Yi-34B-32K-v1.i1-Q2_K	65
Fook-Yi-34B-32K-v1.Q2_K	66
Fook-Yi-34B-32K-v1.i1-IQ3_S	68
Fook-Yi-34B-32K-v1.i1-Q3_K_S	68
Fook-Yi-34B-32K-v1.IQ3_S	69
Fook-Yi-34B-32K-v1.i1-IQ3_M	69
Fook-Yi-34B-32K-v1.IQ3_M	70
Fook-Yi-34B-32K-v1.IQ3_XS	70
Fook-Yi-34B-32K-v1.Q3_K_S	70
Fook-Yi-34B-32K-v1.i1-Q3_K_M	73
Fook-Yi-34B-32K-v1.Q3_K_M	75
Fook-Yi-34B-32K-v1.i1-IQ4_XS	77
Fook-Yi-34B-32K-v1.IQ4_XS	78
Fook-Yi-34B-32K-v1.Q3_K_L	78
Fook-Yi-34B-32K-v1.i1-Q3_K_L	78
Fook-Yi-34B-32K-v1.Q4_0	79
Fook-Yi-34B-32K-v1.i1-Q4_0	80
Fook-Yi-34B-32K-v1.i1-Q4_K_S	80
Fook-Yi-34B-32K-v1.IQ4_NL	81
Fook-Yi-34B-32K-v1.i1-IQ4_NL	81
Fook-Yi-34B-32K-v1.Q4_K_S	82
Fook-Yi-34B-32K-v1.Q4_K_M	83
Fook-Yi-34B-32K-v1.i1-Q4_K_M	84
Fook-Yi-34B-32K-v1.Q4_1	87
Fook-Yi-34B-32K-v1.i1-Q4_1	87
Fook-Yi-34B-32K-v1.Q5_0	93
Fook-Yi-34B-32K-v1.i1-Q5_0	93
Fook-Yi-34B-32K-v1.i1-Q5_K_S	93
Fook-Yi-34B-32K-v1.Q5_K_M	95
Fook-Yi-34B-32K-v1.Q5_K_S	95
Fook-Yi-34B-32K-v1.i1-Q5_K_M	95
Fook-Yi-34B-32K-v1.Q5_1	100
Fook-Yi-34B-32K-v1.i1-Q5_1	100
Fook-Yi-34B-32K-v1.i1-Q6_K	109
Fook-Yi-34B-32K-v1.Q6_K	111
Fook-Yi-34B-32K-v1.Q8_0	130
Fook-Yi-34B-32K-v1.SOURCE	227

mradermacher

Owner Sep 21

•

edited Sep 21

Ah, the claim was that Q4_1 uses less energy than Q4_K_M ("for the quality"), which is a lot of variables. And is, however, is also not backed up by your benchmarks (assuming longer time means more energy usage), unless somehow "for the quality" figures in in favour of Q4_1, which, again seems to be not the case.

@yttria ,it seems Q4_1 is bigger, slower and worse than Q4_K_M, and not that much better than Q4_0, which is even faster.

PS: I didn't try to make you make these benchmarks, but I take them :-) However, they do seem a bit fishy - they follow more or less exactly the quant size, indicating a memory bottleneck, so cpu speed doesn't even figure in. I would assume yttria did it on a cpu with a lot fewer cores, where things can be dramatically different. Which is why I provide Q4_0 in the first place, as a fast quant for cpus. I hope one of the results on all this benchmarking (quality and speed) is to tell us once and for all if Q4_0 actually is useful (it might well be that another quant gives better a better quality/time ratio).

nicoboss

Sep 22

•

edited Sep 22

I didn't try to make you make these benchmarks, but I take them :-) However, they do seem a bit fishy - they follow more or less exactly the quant size, indicating a memory bottleneck, so cpu speed doesn't even figure in. I would assume yttria did it on a cpu with a lot fewer cores, where things can be dramatically different. Which is why I provide Q4_0 in the first place, as a fast quant for cpus. I hope one of the results on all this benchmarking (quality and speed) is to tell us once and for all if Q4_0 actually is useful (it might well be that another quant gives better a better quality/time ratio).

Great point. I'm aware that the shared performance test results are not perfect which is not surprising as I never indended thouse tests to be used to measure quant performance . I did all tests on a Threadripper PRO 7975WX (32 core 64 threads) while also using the GPU to do compuatations. This resulted in super fast prompt evaluation during prompt evaluation heavy tests. So thouse tests are indeed almost certainly memory bottlenecked which might not always be the case during normal use depending on the hardware. I'm wondering if there are really any realistic cases where LLMs are not memory bottlenecked. After just a few threads I start seeing deminishing returns when adding more. Usual consumer hardware has just dual channel memory and so should be bottelnacked even faster. I definately want to do more carefull testing regarding this and maybe even test on more mainstream hardware like my laptop. I actually already did many GGUF performance tests one year ago during the planing stage of my StormPeak build. I did so by changing the memory channels and memory clock-speed and amount of threads assigned to llama.cpp. The conclusion back then was pritty much the more memory channels and the faster memory I get the better the performance of GGUF files executed on the CPU will be. This was the main reason I decided to go for the much more expensive Threadripper PRO with octa-channel instead of the normal Threadripper lineup with quad-channel memory for my StormPeak node.

mradermacher

Owner Sep 22

•

edited Sep 22

I agree with your methodology, these side effects were of course not the goal. There are lots of realistic cases where the cpu is the bottleneck, though. In fact, most cpus will be the bottleneck when confronted with IQ quants, and memory bottlenecked with normal Q quants. Which is totally in line with your experience (no IQ quants last year). There are even lots of systems where Q4_K_M might be a bottleneck, which is why I added Q4_0 quants back - purely for speed.

Whenever I get to my performance benchmarks, my plan is to make a 4-core baseline test for specifically slower cpus. There are a lot of people who run smaller models on not very beefy laptops, for example.

yttria

Sep 22

•

edited Sep 22

This is my test on M3 Max processing and generating a fixed number of tokens:

Prompt processing

Quant	Time / s	Energy / J
F16	5.1	254
Q8_0	5.3	270
Q6_K	6.2	320
Q5_1	5.8	290
Q5_K	6.3	352
Q5_0	5.8	290
Q5_K_S	6.3	322
Q4_1	5.3	257
Q4_K	5.7	290
IQ4_NL	5.5	276
Q4_K_S	5.6	285
Q4_0	5.3	259
IQ4_XS	5.5	283

Token generation

Quant	Time / s	Energy / J
F16	21.2	476
Q8_0	12.2	323
Q6_K	10.0	412
Q5_1	10.3	416
Q5_K	10.4	477
Q5_0	9.6	391
Q5_K_S	10.4	489
Q4_1	8.2	241
Q4_K	8.3	312
IQ4_NL	8.4	345
Q4_K_S	8.1	304
Q4_0	7.6	227
IQ4_XS	7.7	307

mradermacher

Owner Sep 22

well, the Q4_0 seems to more energy-efficient than Q4_1, and is probably higher quality per bit, too

yttria

Sep 22

Another thing, I see you are providing f16 ggufs for some bf16 models. Wouldn't it be better to to convert directly to bf16 gguf to eliminate conversion loss?

mradermacher

Owner Sep 22

•

edited Sep 22

since f16 has higher precision and weights should be mostly normalised, there shouldn't be any unless the model already has issues (and these issues would translate to other quants as well). the purpose of the f16 quant is not to provide a faithful representation of the source (which would be provided as SOURCE gguf) but to provide an actual f16 quant, i.e. pretty much the same purpose as the Q4_0, to provide a quant for certain targets.

nicoboss

Sep 22

f16 is for sure more useful then bf16 if you consider you require at least a Nvidia Ampere based GPU for it to support bf16 while f16 is already supported since Tegra X1 (Maxwell+) and so works on Pascal and Turin as and not just Ampere and Ada. This is also why back at university I used a Nintendo Switch console running Linux for scientific computation as my GTX 980 Ti Maxwell GPU did not support half precision. Ironically on the CPU side it is exactly the opposite picture: While bf16 is already widely supported, f16 is not.

There are some rare models where you can find f16, bf16 and SOURCE quants like this one: https://huggingface.co/mradermacher/Fook-Yi-34B-32K-v1-GGUF. SOURCE quants are usually only provided if obtaining the original model is not easily possible. For example, if a model got deleted by the author after mradermacher already downloaded it and he notices it before deleting them.

This is my test on M3 Max processing and generating a fixed number of tokens

Thanks a lot for sharing your measurements. What application did you use to create them? Probably a macOS thing as there seems no simple way for me to measure power consumption used for generating tokens unless I do everything on the GPU.

mradermacher

Owner Sep 22

•

edited Sep 22

@nicoboss your cpu should have a power meter, try /sys/class/powercap/intel-rapl/*/energy_uj - might even have one per core. don't know how good the amd implementation is but the intel estimate is usually pretty good, and amd usually does better.

yttria

Sep 22

•

edited Sep 22

What application did you use to create them?

Energy is measured with the built-in powermetrics utility on macOS.

nicoboss

Sep 22

your cpu should have a power meter, try /sys/class/powercap/intel-rapl/*/energy_uj

That is really cool. Wasn't aware of it. AMDs intel-rapl implementation is decent. There also is amd_energy which is so accurate it got kicked out of the linux kernel due to security concerns regarding side-channel attacks but I can just build and load it as a kernel module.

mradermacher
/

model_requests

Request to bring back Q4_1

Comparison between Q4_0 and Q4_1

Conclusion

Future research

Results of all quants

MMLU performance

ARC easy performance:

ARC challenge performance

Winogrande performance

Prompt processing

Token generation