8 bits quantization is not working with the model on the latest oobabooga

by Stilgar - opened Oct 12, 2023

Oct 12, 2023

gptq-8bit-32g raised the following exception:
site-packages\exllama\cuda_ext.py", line 33, in ext_make_q4
return make_q4(qweight,
RuntimeError: qweight and qzeros have incompatible shapes

I'm using "ExLlama_HF" loader, autoGPTQ not better

I'm not clear what it want to do with "ext_make_q4"

gptq-4bit-32g is OK and working without issue

oobabooga is running on it's own virtual env and match latest requirements

TheBloke

Owner Oct 13, 2023

•

edited Oct 13, 2023

ExLlama doesn't support 8-bit GPTQ, and AutoGPTQ doesn't currently support Mistral GPTQ.

Please try using Transformers as the Loader, and see if that works. I've not personally tested it in Oobabooga, but I know Transformers works from Python code

Stilgar

Oct 13, 2023

It works with transformer but it’s slow (3 to 4 tokens/s), the GGUF release with 6 bits quantization and llama.cpp runs at 8 tokens/s and the GPTQ 4 bits with ExLlama_HF reaches 45 to 48 token/s.
I’m using a 4090, and the last case is using 11Go on the GPU.
I cannot see a quality difference between the models at least for storytelling, but did not test long
Thank you for you work and your reply.

Stilgar changed discussion status to closed Oct 13, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment