3 x 4096 - won't load
just pulled latest exllamav2, 5-bit won't load.
at the very end of load it suddenly starts adding more to gpu1.
I've tried all the way down to {16,19,23}. Initial load and allocations look ok, then blows up gpu1 vram.
Plenty of room left on gpu3...
Tried setting cache max_seq_len down to 8192, same behaviour
I'll try a smaller quant and post what happens here.
Same behavior with 3-bit. Guess I'll re-install exllamav2...
8k context is still a lot for this model. You'd need 20 GB of VRAM just for the cache at 8k (no GQA), plus 47 GB for the weights at 5 bpw, and the large vocabulary means the implementation has to reserve 2.4 GB of temp buffer on the last device to accommodate the output layer. So with activations and Torch/CUDA overhead, 3x24 GB isn't going to cut it unless you drop the bitrate or the context some more.
Update for those on same path:
with context size 8192 you can load 4-bit with {13, 15, 23}
turboderp says context is really expensive, so presumably you can also try smaller context if you can live with it.