Use these flags to run this model and any model thats GPTQ split between gpu and cpu if you dont have enough vram
Here are the flags to run simultaneously with gpu and cpu. You can change pre_layer to whatever amount you want to load onto your gpu, and max_seq_len to however many tokens you want your model to generate at max, make sure its either 2048, 4096, 6144 or 8192 and make sure compress_pos_emb matches at 1,2,3 and 4
python server.py --chat --model TheBloke_WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ --pre_layer 8 --loader exllama_HF --compress_pos_emb 4 --max_seq_len 8192 --trust-remote-code
Cool, thanks for posting.
What is performance like when CPU offloading?
With this particular model i get 4 tokens a second with 10gb filled ni vram and 20 gb filled in system ram. However I am running XMP profile 1 on my 64gb ram oc'd to 3600mz. And my evga xc3 ultra 3080 is overclocked with a memory OC of +400 and a gpu OC of +100. That does boost the performance a bit. Full specs bellow:
Evga Rtx 3080 xc3 ultra 10gb vram
Ryzen 5600x CPU
64gb 3200mhz ram (OC to 3600mzhz)
msi b550 pro vhd wifi motherboard
750w powersupply
I should note that that 4 tk/s is the speed it starts at, it slows down to 2 tk/s after its running for a little while.
OK, so not unusable. But pretty slow, and I would expect GGML to be faster.
Have you tried KoboldCpp instead, with CUDA acceleration and partial GPU offloading? It means you can't use text-generation-webui, but KoboldCpp is quite a good GUI as well.
Yea I have used koboldcpp, but I cant get over how these models require the "Start reply with" feature that oobagooba has to do anything right sometimes. They just plain dont follow instructions correctly a lot of the time unless you force them to. So that tends to be mostly why i avoid koboldcpp, not because i dont like it, but because the models suck and require alot of extra guidance