Any way to make it work on a RTX 3070 (8GB VRAM)?

#1
by tarruda - opened

Hi, thanks for sharing this model

I'm following the instructions here to load and interact with the model, but the 4bit version does not fit in my GPU.

Using the same instructions to load your model (v2) works up to the first "generate" request. The second time I ask a question, it fails with an out of memory error.

Do you know if there are any settings I can tweak to make it fit in my GPU, or is 8GB VRAM just too low for this model? This is the command I used to run: python server.py --model berker_vicuna-13B-1.1-GPTQ-3bit-128g-v2 --auto-devices --wbits 3 --groupsize 128 --chat

Hello, I have just tried to load this model using KoboldAI with Occam's fork for 4bit. For some reason, this model does not load into VRAM, which is unfortunate as the tokens per second will be very slow. Have you gotten this model to load on VRAM and work correctly since then?

Sign up or log in to comment