Is the KV cache of these models unusually high?
#6
by
Hugsanir
- opened
I noticed that for a 2048 token context window, llama.cpp allocates 2560MiB for the KV cache which seems extraordinarily high. This is without any KV quantization.
Here is a table I threw together with various models and their KV cache sizes at 2048 context sizes. They are all GGUF quantized to varying degrees but that doesn't seem to make a difference. Try to spot the outlier 😁
Model | Params | KV | Keys | Values |
---|---|---|---|---|
Mixtral-8x7B-Holodeck-v1 | (48B) | 256 MiB | 128 MiB | 128 MiB |
Meta-Llama-3-8B-Instruct | (8B) | 256 MiB | 128 MiB | 128 MiB |
Meta-Llama-3-70B-Instruct | (70B) | 640 MiB | 320 MiB | 320 MiB |
Qwen1.5-32B-Chat | (32B) | 512 MiB | 256 MiB | 256 MiB |
Yi-1.5-34B-Chat | (34B) | 480 MiB | 240 MiB | 240 MiB |
functionary-small-v2.4 | (7B) | 256 MiB | 128 MiB | 128 MiB |
c4ai-command-r-v01 | (35B) | 2560 MiB | 1280 MiB | 1280 MiB |