inference time and memory usage of quantized version
#18
by
DILLIP-KUMAR
- opened
I've tested Qwen-2VL-2B-Instruct and its quantized versions (e.g., GPTQ-Int8, 4-bit, etc.) on different GPUs including V100, RTX 4090, RTX 3060, and T4 (Colab). Despite the reduced GPU memory usage of quantized models compared to the original 16-bit version, I observed an unexpected increase in inference time:
Int8 inference is slower than 16-bit.
4-bit inference time is in between 16-bit and 8-bit.
I would appreciate insights into why this increase in inference time occurs with quantized models and suggestions on how to optimize inference speed.
Additional Testing: This behavior was also consistent across other models like Qwen2.5-Coder with different quantization methods (e.g., GPTQ, AWQ, and custom linear quantization**).