run this model on a 2xRTX4090 machine with vLLM

by choucavalier - opened Mar 19

Discussion

choucavalier

Mar 19

i'm trying to run this model on a 2xRTX4090 machine using vLLM for serving

it seems that my system is not able to run it (each GPU has 24GB of VRAM)

is this expected?

thanks

fbirlik

Mar 20

i'm trying to run this model on a 2xRTX4090 machine using vLLM for serving

it seems that my system is not able to run it (each GPU has 24GB of VRAM)

is this expected?

thanks

I have two 3090's, similar to your case, I use DPO version, but it may be similar. I enter following parameters for vllm

--gpu-memory-utilization 0.8
--quantization awq
--tensor-parallel-size 2

choucavalier

Mar 20

thanks man, this worked! i was using the same args but with ---gpu-memory-utilization 0.98 (I thought I was giving more memory to the model)

choucavalier changed discussion status to closed Mar 20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment