Any one can use VLLM or any other engine support dynamic batch to run this with more than 1 GPU?

by bash99 - opened Sep 21

Sep 21

I can run this with the example python code.

But vllm alway complained that "ValueError: The input size is not aligned with the quantized weight shape.", which seems is no solution from vllm side
https://github.com/vllm-project/vllm/issues/5675

Or can I run convert it to GPTQ use groupsize of 64, I'm not sure does vllm support it.

bash99

Sep 21

I've find https://qwen.readthedocs.io/en/latest/quantization/gptq.html , in the Troubleshooting section, it said You can pad the origin model then quantize it, which require a very large memory. Why don't just release one that works?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment