Set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
Sorry for simple question but how did you change the environment variable on tabbyapi? I edited the end of my start.sh to terminate with
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
python start.py "$@"
and it seems to break autosplit functionality on config.yaml.
Hey, no worries.
I ran into that same issue a while back. It’s actually a bug that popped up in exllamav2 0.2.3 . Here’s the bug tracker for it: https://github.com/turboderp/exllamav2/issues/647
It’s already been fixed in the dev branch, but it’ll probably be in the next release. I ended up rolling back to a TabbyAPI commit (56ce82e) that still used exllamav2 0.2.2, and that worked fine for me.
It works as intended now - appreciate it!
Would it be possible to make a request? I have heard very good things about a recently released Llama 3.1 model (https://huggingface.co/MikeRoz/ArliAI_Llama-3.1-70B-ArliAI-RPMax-v1.2-4.5bpw-h6-exl2) however I'm having a little bit of trouble trying to optimize the exact size for 48 VRAM. The 4.5 BPW can be run at 65K context but 32K is my intended usecase and is slightly too large at 5 bpw. Is that a setup issue? What is your experience with Llama 3.1 quants for 48 GB @32K?
Glad to hear it's working for you now!
I’m not taking requests at the moment, but I can share my experience. LLaMA 3.1 70B models can just about handle 5bpw with a 32K context on 48GB VRAM, but that’s assuming the VRAM is completely empty(which mine is since I plug my monitor to the APU) and you set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync (without this TabbyAPI gives me OOM).
I downloaded the 5bpw version of the model you mentioned (https://huggingface.co/MikeRoz/ArliAI_Llama-3.1-70B-ArliAI-RPMax-v1.2-5.0bpw-h6-exl2) and it loaded fine for me at 32K.
I’ve taken some screenshots to show how it fits:
TLDR: It's most likely a setup or config issue.
Thanks for troubleshooting and your continued work in this rather particular niche. Look forward to future releases!