What 2bit quantization approach are you using?
Hi, I was wondering what the exact quantization approach is.
Are you using the approach described in this Feb 6 2024 article out of Austria?: https://arxiv.org/pdf/2401.06118.pdf
I'm using my own. It is described, admittedly with not a lot of detail, in these llama.cpp
PRs: https://github.com/ggerganov/llama.cpp/pull/4773, https://github.com/ggerganov/llama.cpp/pull/4856, https://github.com/ggerganov/llama.cpp/pull/4897. The last one requires the importance matrix, added with https://github.com/ggerganov/llama.cpp/pull/4861.
I'm aware of the AQLM paper and they do get a slightly lower quantization error at exactly 2 bits-per-weight (the IQ2_XXS
quants in llama.cpp
). But already at 2.25 bits-per-weight (IQ2_XS
quants in llama.cpp
, my quantization is significantly better than theirs. They get the better quantization by a pretty long training run (many GPU hours on a high-end GPU) that updates what they call a "codebook", along with some of the model parameters (output normalization and such). I have decided against going on that route because the main focus of llama.cpp
, where I'm contributing, is "Inference at the Edge", so users must be able to quantize their own models in a reasonable amount of time (and they can now do that. The only reason these repositories exist is that there was a period of time where inference with these quants was possible but the quantization methods weren't yet added to llama.cpp
).