ikawrakow/various-2bit-sota-gguf · What 2bit quantization approach are you using?

I'm using my own. It is described, admittedly with not a lot of detail, in these llama.cpp PRs: https://github.com/ggerganov/llama.cpp/pull/4773, https://github.com/ggerganov/llama.cpp/pull/4856, https://github.com/ggerganov/llama.cpp/pull/4897. The last one requires the importance matrix, added with https://github.com/ggerganov/llama.cpp/pull/4861.

I'm aware of the AQLM paper and they do get a slightly lower quantization error at exactly 2 bits-per-weight (the IQ2_XXS quants in llama.cpp). But already at 2.25 bits-per-weight (IQ2_XS quants in llama.cpp, my quantization is significantly better than theirs. They get the better quantization by a pretty long training run (many GPU hours on a high-end GPU) that updates what they call a "codebook", along with some of the model parameters (output normalization and such). I have decided against going on that route because the main focus of llama.cpp, where I'm contributing, is "Inference at the Edge", so users must be able to quantize their own models in a reasonable amount of time (and they can now do that. The only reason these repositories exist is that there was a period of time where inference with these quants was possible but the quantization methods weren't yet added to llama.cpp).