TheBloke/Llama-2-70B-Chat-GGML · Has anyone been able to use GPU and CPU more fully for higher speed output?

On an AWS g4dn.metal instance you have 8 GPU cards plus 96 CPU cores. However, with llama.cpp, monitoring with nvidia-smi, I see that my GPUs are only 35% utilized with the 70B q4 model (and less with the smaller models). I also notice that if leaving -t unspecified it uses 96 threads and this actually slows things down drastically. I found that -t 4 is about as good as it gets, producing 8 tokens per second, but leaving me with 92 CPU cores that I can't use but pay dearly for! Any idea how we can use the resources more fully or what causes the apparent contention with the high CPU thread count? And why we can't fully use the GPUs to at least 80%? I would hope that 24 tokens per second should be possible with this kind of hardware.

https://stackoverflow.com/questions/76916333/how-can-i-use-the-gpus-more-effectively-on-an-aws-g5dn-metal-instance-running-ll