berker/vicuna-13B-1.1-GPTQ-3bit-128g-v2 · Any way to make it work on a RTX 3070 (8GB VRAM)?

Hi, thanks for sharing this model

I'm following the instructions here to load and interact with the model, but the 4bit version does not fit in my GPU.

Using the same instructions to load your model (v2) works up to the first "generate" request. The second time I ask a question, it fails with an out of memory error.

Do you know if there are any settings I can tweak to make it fit in my GPU, or is 8GB VRAM just too low for this model? This is the command I used to run: python server.py --model berker_vicuna-13B-1.1-GPTQ-3bit-128g-v2 --auto-devices --wbits 3 --groupsize 128 --chat