run this model on a 2xRTX4090 machine with vLLM

#2
by choucavalier - opened

i'm trying to run this model on a 2xRTX4090 machine using vLLM for serving

it seems that my system is not able to run it (each GPU has 24GB of VRAM)

is this expected?

thanks

i'm trying to run this model on a 2xRTX4090 machine using vLLM for serving

it seems that my system is not able to run it (each GPU has 24GB of VRAM)

is this expected?

thanks

I have two 3090's, similar to your case, I use DPO version, but it may be similar. I enter following parameters for vllm

--gpu-memory-utilization 0.8
--quantization awq
--tensor-parallel-size 2

thanks man, this worked! i was using the same args but with ---gpu-memory-utilization 0.98 (I thought I was giving more memory to the model)

choucavalier changed discussion status to closed

Sign up or log in to comment