inference time and memory usage of quantized version

#18
by DILLIP-KUMAR - opened

I've tested Qwen-2VL-2B-Instruct and its quantized versions (e.g., GPTQ-Int8, 4-bit, etc.) on different GPUs including V100, RTX 4090, RTX 3060, and T4 (Colab). Despite the reduced GPU memory usage of quantized models compared to the original 16-bit version, I observed an unexpected increase in inference time:

Int8 inference is slower than 16-bit.
4-bit inference time is in between 16-bit and 8-bit.
I would appreciate insights into why this increase in inference time occurs with quantized models and suggestions on how to optimize inference speed.

Additional Testing: This behavior was also consistent across other models like Qwen2.5-Coder with different quantization methods (e.g., GPTQ, AWQ, and custom linear quantization**).

Sign up or log in to comment