Use these flags to run this model and any model thats GPTQ split between gpu and cpu if you dont have enough vram

#6
by rombodawg - opened

Here are the flags to run simultaneously with gpu and cpu. You can change pre_layer to whatever amount you want to load onto your gpu, and max_seq_len to however many tokens you want your model to generate at max, make sure its either 2048, 4096, 6144 or 8192 and make sure compress_pos_emb matches at 1,2,3 and 4

python server.py --chat --model TheBloke_WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ --pre_layer 8 --loader exllama_HF --compress_pos_emb 4 --max_seq_len 8192 --trust-remote-code

Cool, thanks for posting.

What is performance like when CPU offloading?

With this particular model i get 4 tokens a second with 10gb filled ni vram and 20 gb filled in system ram. However I am running XMP profile 1 on my 64gb ram oc'd to 3600mz. And my evga xc3 ultra 3080 is overclocked with a memory OC of +400 and a gpu OC of +100. That does boost the performance a bit. Full specs bellow:

Evga Rtx 3080 xc3 ultra 10gb vram
Ryzen 5600x CPU
64gb 3200mhz ram (OC to 3600mzhz)
msi b550 pro vhd wifi motherboard
750w powersupply

I should note that that 4 tk/s is the speed it starts at, it slows down to 2 tk/s after its running for a little while.

OK, so not unusable. But pretty slow, and I would expect GGML to be faster.

Have you tried KoboldCpp instead, with CUDA acceleration and partial GPU offloading? It means you can't use text-generation-webui, but KoboldCpp is quite a good GUI as well.

Yea I have used koboldcpp, but I cant get over how these models require the "Start reply with" feature that oobagooba has to do anything right sometimes. They just plain dont follow instructions correctly a lot of the time unless you force them to. So that tends to be mostly why i avoid koboldcpp, not because i dont like it, but because the models suck and require alot of extra guidance

Sign up or log in to comment