Help regarding best quantization for below PC specification.

#15
by bikkikumarsha - opened

Since I don't want to spend hours download something that doesn't run. I am wondering which quantization would fit nicely in my system.
CPU: Intel core i7 13700k
Ram: 64GB
GPU: RTX 3090 24GB dedicated memory

Also is there any general rule of thumb that we should be following.

Hi,
I am about to upload the IQ-1 models, which are the smallest models. Those should fit without any issue.

Thanks for all the good work..

Hello,

I have roughly the same setup as you (RTX3090, 64GB RAM, Intel Core i9) and for my testing, I used TextGenerationWebui and the IQ1_M model loaded with 36 layers on the GPU.

Also, I had to limit the context size to 4096 and it seems that the max_new_tokens value has an impact on the quality of the results. I get better results with a max_new_tokens of 512 than with 1024.

However, it is very slow, I only get 1.6 tokens/s, so not really usable due to the delays.

I did a quick test last night with an RTX6000 which allows this model to be fully loaded in VRAM and I was getting around 25 tokens/s. In my opinion, this model requires more power than what a standard gaming PC has.

Thanks @tsalvoch for sharing your setup, that's a big help to others.

@MaziyarPanahi Just curious, is there a quant worth running on dual 3090's? Preferably fully loaded into VRAM with 8-32k context (highest possible)... or would you need at least 3 3090's to run a quant that's worth it?

@Adzeiros depends on the model to be honest. if we are talking about this specific model, it's pretty huge! so even a Q3 would do a good job and still pretty beefy. I've seen people using 2 3090 with some offloading, but they were happy for their use cases. (proofreading, rewriting, etc.)
Based on the size of the model and the tasks you require, it should be possible to tradeoff and find a middle ground for 2 or 3 3090

Sign up or log in to comment