Are these models limited to H100s?

#2
by RonanMcGovern - opened

I've run well on H100s but on A100s or A6000s, I get:

[rank0]: RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Is it possible to upgrade the A100 or H100? or am I just limited here?

Neural Magic org

Unfortunately this model is limited to GPUs that support the FP8 data format, including the Hopper architecture but excluding the Ampere architecture.

Makes a lot of sense. Thanks for the nice work. It's kind of interesting but it seems that using weight-only fp8 I'm able to get pretty much the same results on A100s as with full fp8 on H100 hopper.

RonanMcGovern changed discussion status to closed
Neural Magic org

@RonanMcGovern this model should still run in vLLM on A100, it will just chose to run in the FP8 weight-only pathway. Are you using the latest vLLM release?

Ok, interesting. I'm using the latest docker image (maybe updates haven't been pushed to that yet?). The error I'm getting is:

[rank0]: RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+
Neural Magic org

My apologies, I got confused with the various formats available. Currently this is blocked on https://github.com/vllm-project/vllm/pull/6524. Thanks for reporting, we will work on landing ASAP

ok yeah that would be great, I'll move my models over to the neuralmagic ones once that works because fp8 download is faster and also doesn't require HF_TOKEN

btw @mgoin the fp8 is almost as fast as Nvidia NIM on an H100 SXM, which is impressive - at least at batch size 1. 130 toks vs 120 toks on a short prompt with 500 tokens generated.

At larger batches, speeds diverge until at a batch of 64, NIM is still managing about 120 toks, while vLLM fp8 neural magic is doing about 35 toks.

I wonder what the difference is behind that. Probably the gap can be closed if the batch size one is so close.

Sign up or log in to comment