Any one can use VLLM or any other engine support dynamic batch to run this with more than 1 GPU?
#1
by
bash99
- opened
I can run this with the example python code.
But vllm alway complained that "ValueError: The input size is not aligned with the quantized weight shape.", which seems is no solution from vllm side
https://github.com/vllm-project/vllm/issues/5675
Or can I run convert it to GPTQ use groupsize of 64, I'm not sure does vllm support it.
I've find https://qwen.readthedocs.io/en/latest/quantization/gptq.html , in the Troubleshooting section, it said You can pad the origin model then quantize it, which require a very large memory. Why don't just release one that works?