Issue with multi GPU inference.
Just tested the model, looks good. But it seems you have inherited an issue from the base falcon. When inferencing over multiple gpus I get gibberish unless I pass use_caching=False
in the model.generate function. Not sure why this happens.
Same issue with the open assistant rlhf llama model on multi gpu. I can't test it right now, but if your flag fixes it, I think it's a bitsandbytes issue. Because for me it only gave the errors with load_in_8bit=true
use_cache=False
does not fix it for that one :(
I didn't use quantisation for the falcon. I just loaded it in 4 v100s.
Eastwind? How were you able to harness all GPUs for a single prompt? I tried with a 8xA100x80G machine and it only used 1 GPU crashing the model for lack of memory.. can you share config and code? Pretty please...? THX.
I tried doing it with device_map = auto . It works with cache disabled. But I think for multigpu as the falcon authors replied to my original post. They said to use the huggingface text inference hosting solution. I haven't tested it out however
Thx. Unfortunately that hosting solution is no longer available.. do you know what was its configuration?
See this issue that I also made lol. I haven't tested but given the fact that they have it working. I would assume it works. https://github.com/huggingface/text-generation-inference/issues/417
Thanks :-)
Just tested the model, looks good. But it seems you have inherited an issue from the base falcon. When inferencing over multiple gpus I get gibberish unless I pass
use_caching=False
in the model.generate function. Not sure why this happens.
Eastwind, could you please tell how fast the inference was? Because for small prompts on the V100 also it is taking me a good minute to render the response and the longer prompts it is crashing with a CUDA OOM error.
Eastwind, could you please tell how fast the inference was? Because for small prompts on the V100 also it is taking me a good minute to render the response and the longer prompts it is crashing with a CUDA OOM error.
I've noticed Falcon 40B works fine in 16bit (bfloat) mode. When running it in 8bit, it runs like garbage and is CPU bound. The performance is HORRIBLE in anything but 16/32bit and it's always CPU bound. Running in 16/32bit, it uses my cards sequentially and runs them up to 90% utilization and then pops to the next card. I've got 3x 48gb A6000 cards. When loaded in 16bit, it takes up about 30gb on each of the cards and during inference, this can climb as high as 45gb per card. This model is resource hungry and does no operate in 8bit or 4bit quantized well at all.
That definitely is in line with the performance I have seen on a 4xV100S cluster (128GB combined VRAM). Using 8bit and 4 bit changes the model size, but does not speed up inference. (0.75 token per second)