[llama.cpp PR#7527] GGUF Quantized KV Support

#15
by Lewdiculous - opened
AetherArchitectural org
β€’
edited Jun 2

Related PRs:
https://github.com/ggerganov/llama.cpp/pull/7527
https://github.com/ggerganov/llama.cpp/pull/7681
https://github.com/ggerganov/llama.cpp/pull/7412

Available in the KoboldCpp builds from Nexesenex:
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.67b_b3066

The legend as always, providing it for the thirsty early adopters.

This is actually so huge, honestly I can almost double my --contextsize now. It's a straight up +50% boost at least.

Lewdiculous changed discussion title from [llama.cpp PR#7527] Quantized KV Support to [llama.cpp PR#7527] GGUF Quantized KV Support
AetherArchitectural org
β€’
edited Jun 2

This thread is for discussions, testing, sharing results, questions, issues, coping, dreams... Anything goes.

AetherArchitectural org
β€’
edited Jun 2

For me, right now, as soon as your context is full and you trigger Context Shifting it crashes.

[Context Shifting: Erased 140 tokens at position 1636]GGML_ASSERT: U:\GitHub\kobold.cpp\ggml-cuda\rope.cu:255: src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16
<CRASH>

But very promising, for this point of the implementation.

only 3063 q8_0 has passed all tests flawlessly

This will be the most stable and may fix the crash issue
https://github.com/Nexesenex/kobold.cpp/releases/download/v1.67b_b3066/koboldcpp_cuda_12.2_K8_V51.exe

AetherArchitectural org

This is the one I was using to test already actually. Have you successfully Context Shifted? I tested with --contextsize 6144 in an existing conversation that was about 8K long.

This is the one I was using to test already actually. Have you successfully Context Shifted? I tested with --contextsize 6144 in an existing conversation that was about 8K long.

CtxLimit: 58/8192, Process:0.43s (10.2ms/T = 97.67T/s), Generate:3.01s (188.2ms/T = 5.31T/s), Total:3.44s (4.65T/s)
CtxLimit: 8192/8192, Process:34.21s (4.4ms/T = 227.17T/s), Generate:185.17s (440.9ms/T = 2.27T/s), Total:219.37s (1.91T/s)
[Context Shifting: Erased 420 tokens at position 2]GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false

Yeah - same for me, getting this whenever it shifts.
It's more than a welcome update in any case and I'm sure they'll fix this soon

[Context Shifting: Erased 165 tokens at position 1686]
Processing Prompt (24 / 24 tokens)GGML_ASSERT: U:\GitHub\kobold.cpp\ggml-cuda\rope.cu:255: src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16

[process exited with code 3221226505 (0xc0000409)]

All i can find for the error code is memory corruption,
I tried K8 V5_1 and KV5_1 with no luck, and like every setting in the gui
Also tried the old smart context, it also dies

Just to experiment, this is Llama-3-8B-Q6_K @ 16K(almost) with a Llava mmproj loaded
image.png image.png

Yi-9B-32K-Q5_K_M @ 32K

image.png

AetherArchitectural org

Yi really is super skinny, uh?

Yi really is super skinny, uh?

Yi's context is tiny, its magic 😭
I've been waiting for a Yi-9B rp model for a while, it's really smart and has better reasoning than most models i've tried in instruct.
Plus it has a native 16K chat version. Something that could actually be useful if quantization of KV becomes stable enough.

Also with some testing there is about a 1/3 reduction in gen speed with KV quantization
35T/s -> 25T/s
Can't really tell a difference though, I can't read that fast

Also with some testing there is about a 1/3 reduction in gen speed with KV quantization
35T/s -> 25T/s
Can't really tell a difference though, I can't read that fast

upcoming llama.cpp fork has a multi-threaded πŸ‘€ update - read a different word with each eye at the same time for x2 token reading speed

Also with some testing there is about a 1/3 reduction in gen speed with KV quantization
35T/s -> 25T/s
Can't really tell a difference though, I can't read that fast

upcoming llama.cpp fork has a multi-threaded πŸ‘€ update - read a different word with each eye at the same time for x2 token reading speed

That would be super helpful for higher context while using KV quanting, past 32K context takes forever to read the tokens
With KV quant 0 ->14k ctx averages 600T/s ingestion
Without KV quant 0 -> 14k averages 1100T/s ingestion
I imagine the difference would be quite noticeable when ingesting 32-64K ctx

AetherArchitectural org
β€’
edited Jun 4

KoboldCpp 1.67:

You can now utilize the Quantized KV Cache feature in KoboldCpp with --quantkv [level], where level 0=f16, 1=q8, 2=q4. Note that quantized KV cache is only available if --flashattention is used, and is NOT compatible with Context Shifting, which will be disabled if --quantkv is used.

Context Shifting please come home...

Quantized KV cache + Qwen2's context size sorcery is big
Qwen2-7B-Q5_K_M @ 64K ctx & 8 bit cache
image.png
I'd be concerned if you needed context shifting + 64K ctx 😭
Edit - Spelling

AetherArchitectural org
β€’
edited Jun 7

I'd be concerned if you needed context shifting + 64K ctx 😭

LMAO at that point... Yeah, your ERP has gone too far xD

Honestly that's crazy! Qwen2 Q5 at 64K in only 8GB of VRAM?!!

Are there any prominent RP tunes/merges or are you using the original?

I'd be concerned if you needed context shifting + 64K ctx 😭

LMAO at that point... Yeah, your ERP has gone too far xD

Honestly that's crazy! Qwen2 Q5 at 64K in only 8GB of VRAM?!!

Are there any prominent RP tunes/merges or are you using the original?

I tried the dolphin version, it's uh, interesting?

image.pngimage.png

I'm excited to see qwen 2, however we really need an uncensored RP tune, it's severely censored as expected (coming from China). But CodeQwen7B 1.5 kills it for code so those guys know how to make efficient models. Qwen censorship in general is comical, though :D

It refused to give me a response on safety tips for masturbation because "The request you're making involves activities that can be harmful and potentially illegal. Safety, legality, and ethical considerations are important factors that I must adhere to when providing assistance."

However, they note in the model page that it's good for RP, and I did try it with some of the newer RPG-formatted cards I am working with, and it's doing better than I expected. It's going with the ERP and doing a pretty good job at applying the formatting properly. I would for sure be interested in ERP fine tunes of it. I could run it with 32k context no problem on Q6, but it was running slow. With 8K context at q6 it just flies.

I'm excited to see qwen 2, however we really need an uncensored RP tune, it's severely censored as expected (coming from China). But CodeQwen7B 1.5 kills it for code so those guys know how to make efficient models. Qwen censorship in general is comical, though :D

It refused to give me tips on safety tips for masturbation because "The request you're making involves activities that can be harmful and potentially illegal. Safety, legality, and ethical considerations are important factors that I must adhere to when providing assistance."

I decided to try the base Qwen2-7B to see what the censorship is like:

From cannibalism
image.png
To censorship
image.png
It's not doing well 😭

Lmao, this question with the photography really messes them up badly :D SOLAR also gets it wrong 😭

The riddle shows just how impressive yi-9B is, it can answer the question right 10/10 times
Plus it can manage the weight questions (kg of feathers and lb of steel)
There's a big lack of Yi 1.5 RP models, yet it's so smart and has native 16K @_@

Ah, yeah, the Yi one seems interesting. It would be nice to see more rp tunes in that 9-30 range.

Ah, yeah, the Yi one seems interesting. It would be nice to see more rp tunes in that 9-30 range.

Seeing the performance of the Yi-34B's makes a 24GB gpu so tempting, the reasoning seems better than llama3 70b from playing around with them in the lmsys arena.
And Q4 @ 16K ctx would fit in vram, with cache quanting Q5 might even be possible in vram

Sign up or log in to comment