8 bits quantization is not working with the model on the latest oobabooga
gptq-8bit-32g raised the following exception:
site-packages\exllama\cuda_ext.py", line 33, in ext_make_q4
return make_q4(qweight,
RuntimeError: qweight and qzeros have incompatible shapes
I'm using "ExLlama_HF" loader, autoGPTQ not better
I'm not clear what it want to do with "ext_make_q4"
gptq-4bit-32g is OK and working without issue
oobabooga is running on it's own virtual env and match latest requirements
ExLlama doesn't support 8-bit GPTQ, and AutoGPTQ doesn't currently support Mistral GPTQ.
Please try using Transformers as the Loader, and see if that works. I've not personally tested it in Oobabooga, but I know Transformers works from Python code
It works with transformer but it’s slow (3 to 4 tokens/s), the GGUF release with 6 bits quantization and llama.cpp runs at 8 tokens/s and the GPTQ 4 bits with ExLlama_HF reaches 45 to 48 token/s.
I’m using a 4090, and the last case is using 11Go on the GPU.
I cannot see a quality difference between the models at least for storytelling, but did not test long
Thank you for you work and your reply.