Hardware recommendations
First of all i want to say thank you for all your efforts, in addition could you provide some advice on the recommended hardware specifications required to effectively run models of this size ?
According to the model card, 4_0 should fit all layers into a 3090.
I was going to download 5_1 and try that offloading some layers to RAM. Though haven't tried it yet so maybe I'm wrong in my thinking.
Thank you for your reply, when you say should fit all layers what does it mean, can i expect response time faster then 100 tokens per second ? And is it possible to run on multiple graphics cards ?
Most of these models we use have 40 layers in them and so loading the entire model into the VRAM greatly speeds up inference. When you load less than 40 layers (the entire model), then portion of the model is loaded into RAM and inevitably a slowdown happens. Here's a random link that will explain layers, contexts, inference, etc: https://kipp.ly/blog/transformer-inference-arithmetic/
I do not know of any consumer setup (graphics cards) that can produce 100t/s.
Here I've noticed a nice improvement when combining CPU+GPU to get ~10t/s. Prior to this on pure GPU I was getting 5-6t/s.
You can load the model across multiple graphics cards, but inference will only happen on one as far as I understand it.