How to run Meta-Llama-3-70B-Instruct-FP8 using several devices?
Please, provide an exact script how to run Meta-Llama-3-70B-Instruct-FP8 model using several devices.
It works well, when I use only one device:
'''python
from vllm import LLM
model = LLM(model=model_path, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
'''
But I got problems with cuda, running this:
'''python
from vllm import LLM
model = LLM(model=model_path, tensor_parallel_size = 2, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
'''
Set tensor_parallel_size=NUM_GPUS
when launching
Set
tensor_parallel_size=NUM_GPUS
when launching
It works, I have 4xH100, but I have used only 2 and this caused an error.
So, the working code is:
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"
model = LLM(model=model_path, tensor_parallel_size = 2, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
or
from vllm import LLM
model = LLM(model=model_path, tensor_parallel_size = 4, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
I dont quite follow
I dont quite follow
tensor_parallel_size must be equal to the maximum number of available devices. In my case I had 4 devices so I should have either set tensor_parallel_size = 4 or limited the number of visible devices os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1" and set tensor_parallel_size = 2.