I think this is a promising model, but it deteriorates into broken english quite fast
Make sure you use the settings / presets from the model page and/or high enough quant, I've done ~200 msgs so far with Q5_K_L and 64k context without any issue like this. I've only ever seen stuff like that either with too low quant (I personally draw a line at iQ4_XS) and with messed up settings (i.e. the 0.0 of this model was completely destroyed by using min P, whereas in 0.1 min P works and is recommended on the model page settings)
edit: completely forgot about KV Cache, tl;dr Just don't with Qwen
I've used the settings from the model page with a 6bpw quant (so quite high) and already had this behavior after 10 messages. I don't use any cache for such small models.
I did not use the context and system prompt templates as I prefer my own, but so far I've never had a model where that was an issue.
Is your instruct in ChatML format? if it is, just for the completeness sake still try the instruct + context from the model page, it's 2 clicks and they're imported and if you still get your problem the issue is somewhere else. I've plenty of models that require specific instruct, and all of them require at least the basic (chatml/mistral/llama/alpaca/vicuna etc) prompt format and sequence tokens set right to which I then add my own prompts and instructs, but depending on what prompt format the model is trained I have at least 3-4 I swap around depending on model quite often. So unless all the models you use are using the same prompt format you gotta switch them around to get the most out of each model. Llama gets borderline stoopid with wrong prompt/instruct format. Just my 2 cents.
Okay now I've imported your context.json and instruct.json and it works much better now. Although I have to say your system prompt is pretty bad and results in boring one-line answers. I recommend you update that unless that's what you prefer of course. ;)
But otherwise it seems to work properly. I probably had something wrong somewhere in my instruct sequence. I can give this model a proper try now. Thanks for your help!
Glad you got it working. I can't post an example because my chats are either too personal or godless so you'll have take my word for it, but the replies I'm getting are usually 400-500 tokens structured like this, which is what I'm used to
paragraph of fluff
"1-3 lines of dialogue"
paragraph of fluff
"1-3 lines of dialogue"
paragraph of fluff
I dislike changing the prompts in middle of the chat it can flip the personality of the char, but I'll maybe explore different prompt once I start wooing next tsundere catgirl.
Interesting, because with your system prompt it always put out oneliner and regenerations would actually result in the exact same text:
Output generated in 11.70 seconds (2.65 tokens/s, 31 tokens, context 815, seed 1840390845)
Output generated in 1.18 seconds (26.35 tokens/s, 31 tokens, context 815, seed 1357878541)
Output generated in 1.31 seconds (23.67 tokens/s, 31 tokens, context 815, seed 638079673)
Output generated in 1.28 seconds (24.25 tokens/s, 31 tokens, context 815, seed 1110667178)
Output generated in 1.28 seconds (24.22 tokens/s, 31 tokens, context 815, seed 995965318)
Output generated in 1.06 seconds (29.24 tokens/s, 31 tokens, context 815, seed 1944607958)
And when I changed back to my original system prompt it became much more talkative:
Output generated in 3.99 seconds (38.34 tokens/s, 153 tokens, context 1378, seed 2116581699)
Output generated in 4.60 seconds (41.52 tokens/s, 191 tokens, context 1446, seed 804217911)
Output generated in 4.16 seconds (40.60 tokens/s, 169 tokens, context 1674, seed 1583588257)
Output generated in 3.74 seconds (39.55 tokens/s, 148 tokens, context 1225, seed 1958565187)
Output generated in 4.52 seconds (40.29 tokens/s, 182 tokens, context 1481, seed 228422603)
Output generated in 4.35 seconds (34.73 tokens/s, 151 tokens, context 2313, seed 1968471248)
Output generated in 4.61 seconds (39.89 tokens/s, 184 tokens, context 2420, seed 602149675)
Output generated in 4.39 seconds (38.72 tokens/s, 170 tokens, context 1979, seed 967732612)
Weird maybe it has something to do with the python library underneath being different but I don't know. But at least the model works now properly without those glitches. :D
This could be either due to different backed and/or quant architecture. Or maybe the sampler parameters. I've gathered anectodal nonreproducible evidence where same model same quant same prompts produced vastly different outputs quality-wise depending on backend, but every time I bring it up id devolves into ooba is better than kobold or vice versa which is why I don't want to bring it up. Personally using kobold coz I often share with horde.
My sampler settings for completeness sake
I don't have XTC since I'm not using the staging branch and I'm on Ooba and I use exl2 quants rather than gguf, otherwise I've used the parameters you've listed on the model page.
XTC been merged in stable SillyTavern 1.12.6
Only for Kobold. It's still in staging for Ooba.
@asdfsdfssddf Thanks for the help with the sampler setting. @Jellon Instead of EXL2 I actually switched to Aphrodite-Engine FP6/X depending on your local hardware on the fly quants as they have been giving me the best results so far beyond just hosting FP16/not BF16.
I'm super happy with EXL2 honestly. Really no complaints, it's fast good quality for the size.
@Jellon It's my go to non-production quant for sure. So one thing I can think of is "technically" EXL2 even with do_sample=False is non determisitic. I can't find the discussion on it right now in the EXL2 repo but it comes down to tiny errors adding up over time to cause in a sense token flips. Not sure if this model induces it more but it could in theory be the cause.
Could you elaborate more what you mean by tiny errors over time? What timeframe are we talking about? Non deterministic is usually a positive in my book. I generally don't mind if a certain seed doesn't produce the same result with the same setting.
@Jellon I actually found the github issue where Turboca discusses this. https://github.com/turboderp/exllamav2/issues/232#issuecomment-1860896496 Now I don't think this is what is happening but it's in theory possible.