RTX 3070, only getting about 0,38 tokens/minute
I've played with the parameters a bit, but even with using:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128 --gpu-memory 7 --pre_layer 19
It still is really slow. Now I know my Card only has 8 Gigs of VRAM, and I've fixed the running out of VRAM problem, but it still seems a bit slow, no matter what I do.
I don't know if this is relevant, but my general specs are:
Ryzen 9 3900X
16GB DDR4 RAM
RTX 3070 8GB
Whoa I get around 8 tokens/s with a 3060 12GB.
So you have set the --pre_layer to 19 which basically puts parts of your model in GPU VRAM and the rest in CPU RAM. The communication between VRAM and CPU-RAM would be extremely slower. Not sure if this is the only reason, check if you installed with the latest / fastest CUDA and pytorch versions.
I also have a video card that only has 8GB VRam and the model fits in the card's memory but there isn't enough room for inference. Once you put some layers of the model on the CPU it became super slow. I'm very disappointed that they only put 8GB on a 3070.
This model runs pretty well on a 3080 and fits into the 10GB of VRAM. If you have another nvidia card, you might be able to use the vram on both cards.
I've played with the parameters a bit, but even with using:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128 --gpu-memory 7 --pre_layer 19
It still is really slow. Now I know my Card only has 8 Gigs of VRAM, and I've fixed the running out of VRAM problem, but it still seems a bit slow, no matter what I do.
I don't know if this is relevant, but my general specs are:
Ryzen 9 3900X
16GB DDR4 RAM
RTX 3070 8GB
Yes, I have the same problem. The GPU is used at only 2%