Max Context?
Hi, I'm using the Q4 K_M on Backard. Just wondering what kind of context I can set it to?
Excellent model by the way, the first that passed every one of my test questions!
This model is based on Llama 3.1 70B so you could go up to 128000 context size however the amount of RAM (or GPU memory if you run it on GPUs) required to do so would be absolutely massive. The fact that you decided to go with i1-Q4_K_M seem to indicate that you don't have enough RAM/GPU memory to run any larger quants probably leaving not that much space left for context size. Just set the context size to the amount you need and if you don't have enough memory go with a smaller quant if your workload really requires such a large context. If you run the model only on Ampere and Ada Lovelace NVidia GPUs you could enable Flash Attention 2 to make context size memory usage scale linearly and so use less memory.
I have a 3090 with 24GB V and 64GB RAM. I'm using Backyard, which has a max setting of 8K or you can enter a custom size. It's running fine at the moment on 16K... If the model has a massive size like 128k then I'll just see what my computer can handle.. I'm hoping for 20K...
Runs, but slow, 0.98tps.
OK, thank you :)