openerotica/writing-roleplay-20k-context-nemo-12b-v1.0

I'm running the model on a Nvidia P5000 with 16Gb, it fits in VRAM just fine, response time can be as low as 5s and averages 20ish once the context fills up. Even worst case, after dozens and dozens of chats, can get a response in under 90 seconds. If I can do this on an old second-hand Pascal-generation GPU then you should have no problem.

What is the model runner you are using? Is it ollama? Kobold? llama.cpp directly? There's no hint on the environment. Some of them support multi-GPU. Some do not. It matters what you use.
Are you running a full fp16 (2-byte) model? Or using a quantized version, like Q8_0? Have you considered maybe running a Q4 variant, which cuts memory consumption by roughly half?
Are you sure the model runner is harnessing the GPUs?
Is it a docker container? Docker needs to pass through the GPU into the container; without specific statements in the compose file you end up with no device pass-through.
How much system RAM do you have? If it is shunting to CPU then your system memory will fill up with the model in most cases.

openerotica
/

writing-roleplay-20k-context-nemo-12b-v1.0

Out of memory error