No multi GPU inference support?
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
Output generated in 2.42 seconds (0.00 tokens/s, 0 tokens, context 65, seed 459973075)
seems to me like there is a total lack of multi GPU support for inference.
I would appreciate it if this was addressed.
best wishes and thank you so much for your hard work!
hi
@dataautogpt3
Can you share a reproducible snippet together with the full traceback of the error? thanks
I'm getting the same issue with the following code:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
return self.greedy_search(
File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1244, in forward
aux_loss = load_balancing_loss_func(
File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 98, in load_balancing_loss_func
gate_logits = torch.cat(gate_logits, dim=0)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
that for computing the loss, I think the code is not the latest cuz I pushed a fix but will check
@bjoernp can you try:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", revision="refs/pr/5")
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_new_tokens=20)
Works! Thanks :)
@bjoernp Hi,
Does the code above should parallelize the model across multiple gpus? Is the device_map='auto' does this work?
Thanks.
Hi
@bweinstein123
Yes, device_map="auto"
should split the model evenly across all GPUs