Best 7b mistral I've used so far.

#8
by Crowno - opened

Ok, right now I'm stuck with 12GB of VRAM, before I found this model I was struggling between using 13B's for better coherency but insanely low context (4K is just not enough for me) or using 7B's for larger context sizes and faster responses but way less coherency and intelligence... it was quite a frustrating situation.

Then I found a 7b mistral model named Dolphin, which for quite some time eased a little of my problems but it still lacked because a lot of the time it was hit or miss following character cards and instructions.

Some user in the SillyTavern Discord mentioned this model so I decided to give it a try and man, this was like the holy grail I've been looking for since I started digging into the LLM world many months ago (I came from the early dark times of llama1 models with 2K context size, long gone are those dark days...). Silicon Maid does not only follows everything in the character card beautifully, it also takes into account whatever I throw into Author's Note (if set at depth 1, it pretty much always follows the instructions there 9 out of 10 times not to say always).

I've been using the LoneStriker 8bpw exl2 quant of this model for best accuracy and I have ever since ditched pretty much all my other models, I prefer this even over 13b models just because how good it just follows what I tell it to do, It doesn't matter how big or complex a model is, if it's bad following instructions then it serves no purpose.

Been using this with 12K context, no issues even in very long chats with the full 12K context window, you have my gratitude for making such a lightweight RP useful model a reality.

Ok, right now I'm stuck with 12GB of VRAM, before I found this model I was struggling between using 13B's for better coherency but insanely low context (4K is just not enough for me) or using 7B's for larger context sizes and faster responses but way less coherency and intelligence... it was quite a frustrating situation.

Then I found a 7b mistral model named Dolphin, which for quite some time eased a little of my problems but it still lacked because a lot of the time it was hit or miss following character cards and instructions.

Some user in the SillyTavern Discord mentioned this model so I decided to give it a try and man, this was like the holy grail I've been looking for since I started digging into the LLM world many months ago (I came from the early dark times of llama1 models with 2K context size, long gone are those dark days...). Silicon Maid does not only follows everything in the character card beautifully, it also takes into account whatever I throw into Author's Note (if set at depth 1, it pretty much always follows the instructions there 9 out of 10 times not to say always).

I've been using the LoneStriker 8bpw exl2 quant of this model for best accuracy and I have ever since ditched pretty much all my other models, I prefer this even over 13b models just because how good it just follows what I tell it to do, It doesn't matter how big or complex a model is, if it's bad following instructions then it serves no purpose.

Been using this with 12K context, no issues even in very long chats with the full 12K context window, you have my gratitude for making such a lightweight RP useful model a reality.

Have you found a better one?

Ok, right now I'm stuck with 12GB of VRAM, before I found this model I was struggling between using 13B's for better coherency but insanely low context (4K is just not enough for me) or using 7B's for larger context sizes and faster responses but way less coherency and intelligence... it was quite a frustrating situation.

Then I found a 7b mistral model named Dolphin, which for quite some time eased a little of my problems but it still lacked because a lot of the time it was hit or miss following character cards and instructions.

Some user in the SillyTavern Discord mentioned this model so I decided to give it a try and man, this was like the holy grail I've been looking for since I started digging into the LLM world many months ago (I came from the early dark times of llama1 models with 2K context size, long gone are those dark days...). Silicon Maid does not only follows everything in the character card beautifully, it also takes into account whatever I throw into Author's Note (if set at depth 1, it pretty much always follows the instructions there 9 out of 10 times not to say always).

I've been using the LoneStriker 8bpw exl2 quant of this model for best accuracy and I have ever since ditched pretty much all my other models, I prefer this even over 13b models just because how good it just follows what I tell it to do, It doesn't matter how big or complex a model is, if it's bad following instructions then it serves no purpose.

Been using this with 12K context, no issues even in very long chats with the full 12K context window, you have my gratitude for making such a lightweight RP useful model a reality.

Have you found a better one?

Well ever since I wrote this review a few months ago I have ever since switched to Kunoichi DPO V2, at first I wasn't satisfied with its performance but that was because the settings I was using with Silicon were not ideal for Kunoichi. This is the quant I've been using https://huggingface.co/Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2

I've tried many other 7b models but always return to Kunoichi, it handles dynamic temp very well (I have dyna temp set to 2.5 max, 0.5 minimum), another alternative I've been testing recently is a Kunoichi merge known as Kunoichi Lemon Royale, I like how it performs because it is has a little more word variety but I can still feel Kunoichi's essence on it, this is the quant I'm using if you're interested: https://huggingface.co/grimjim/kunoichi-lemon-royale-7B-8.0bpw_h8_exl2

Lastly, the other model I tried recently was Tiamat, a Llama3 8b model... I like it can perform evil characters better but I don't use it often, it is biased towards speaking old archaic english nonsense and that's a deal breaker for me, it is quite hard to remove that bias even with system prompts, however I would recommend you trying either Kunoichi DPO V2 or Kunoichi Lemon Royale Merge, they do feel like a nice upgrade over Sillicon Maid and are my two top favorites for now.

Ok, right now I'm stuck with 12GB of VRAM, before I found this model I was struggling between using 13B's for better coherency but insanely low context (4K is just not enough for me) or using 7B's for larger context sizes and faster responses but way less coherency and intelligence... it was quite a frustrating situation.

Then I found a 7b mistral model named Dolphin, which for quite some time eased a little of my problems but it still lacked because a lot of the time it was hit or miss following character cards and instructions.

Some user in the SillyTavern Discord mentioned this model so I decided to give it a try and man, this was like the holy grail I've been looking for since I started digging into the LLM world many months ago (I came from the early dark times of llama1 models with 2K context size, long gone are those dark days...). Silicon Maid does not only follows everything in the character card beautifully, it also takes into account whatever I throw into Author's Note (if set at depth 1, it pretty much always follows the instructions there 9 out of 10 times not to say always).

I've been using the LoneStriker 8bpw exl2 quant of this model for best accuracy and I have ever since ditched pretty much all my other models, I prefer this even over 13b models just because how good it just follows what I tell it to do, It doesn't matter how big or complex a model is, if it's bad following instructions then it serves no purpose.

Been using this with 12K context, no issues even in very long chats with the full 12K context window, you have my gratitude for making such a lightweight RP useful model a reality.

Have you found a better one?

Well ever since I wrote this review a few months ago I have ever since switched to Kunoichi DPO V2, at first I wasn't satisfied with its performance but that was because the settings I was using with Silicon were not ideal for Kunoichi. This is the quant I've been using https://huggingface.co/Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2

I've tried many other 7b models but always return to Kunoichi, it handles dynamic temp very well (I have dyna temp set to 2.5 max, 0.5 minimum), another alternative I've been testing recently is a Kunoichi merge known as Kunoichi Lemon Royale, I like how it performs because it is has a little more word variety but I can still feel Kunoichi's essence on it, this is the quant I'm using if you're interested: https://huggingface.co/grimjim/kunoichi-lemon-royale-7B-8.0bpw_h8_exl2

Lastly, the other model I tried recently was Tiamat, a Llama3 8b model... I like it can perform evil characters better but I don't use it often, it is biased towards speaking old archaic english nonsense and that's a deal breaker for me, it is quite hard to remove that bias even with system prompts, however I would recommend you trying either Kunoichi DPO V2 or Kunoichi Lemon Royale Merge, they do feel like a nice upgrade over Sillicon Maid and are my two top favorites for now.

Thanks for the response! I'm still learning all this. What does it mean to use a quant? I just downloaded regular kunoichi dpo v2 7b, what is the quant version? Have you run into where it denies you a request no matter what you say to it? I was trying one out today, even one that was an "uncensored" llm and it still denied me because "it breaks community guidelines"

Thanks for the response! I'm still learning all this. What does it mean to use a quant? I just downloaded regular kunoichi dpo v2 7b, what is the quant version? Have you run into where it denies you a request no matter what you say to it? I was trying one out today, even one that was an "uncensored" llm and it still denied me because "it breaks community guidelines"

If the model you're using spit that community guidelines crap at you then it is not properly uncensored, don't bother with these models, they're a pain to begin with.
Depending on what do you want to do you should look for models properly finetuned towards role playing, coding, chat assistant, etc.

On another note, Model Quantization is the process of lets say, optimize the model weights so you can actually run them locally with less system resources consumption compared to a full unquantized model.
There are many types of quant formats, GGUF, AWQ, GPTQ, EXL2, etc... I honestly only care about EXL2 quants, in my personal opinion these are the best optimized if you want to run them in full GPU meaning, loading the full model/weights in your VRAM (but that depends on your hardware, some people run GGUF models so they can split the load between GPU and CPU using system RAM, but that's either because personal preference or hardware limitations).

Quants are bits per weight (BPW), the higher the bits, the more accurate the model is (smarter) but consumes more VRAM, the lower the bits, the lower the VRAM it consumes but the model might get dumber.
so it depends a lot on your hardware, like how much VRAM you have, which generation your GPU is (2000, 3000, 4000 series, etc) in my case since I have a pretty decent mid tier GPU with 12GB I've been running 8BPW 7b models at 16K Context with no problems (only possible with EXL2 quants, other formats like GPTQ for example cannot be quantized at 8bits which is almost close to unquantized performance).

I run all my local models in Ooba WebUI with the ExLlama2_HF loader and then play around with the generation settings in SillyTavern which is the frontend I use since I use them mostly for roleplay only, If I have to do any serious real life work stuff for my job I rather use a corporate cloud based model like ChatGPT but for fun stuff and uncensored potential, local only.

Thanks for the response! I'm still learning all this. What does it mean to use a quant? I just downloaded regular kunoichi dpo v2 7b, what is the quant version? Have you run into where it denies you a request no matter what you say to it? I was trying one out today, even one that was an "uncensored" llm and it still denied me because "it breaks community guidelines"

If the model you're using spit that community guidelines crap at you then it is not properly uncensored, don't bother with these models, they're a pain to begin with.
Depending on what do you want to do you should look for models properly finetuned towards role playing, coding, chat assistant, etc.

On another note, Model Quantization is the process of lets say, optimize the model weights so you can actually run them locally with less system resources consumption compared to a full unquantized model.
There are many types of quant formats, GGUF, AWQ, GPTQ, EXL2, etc... I honestly only care about EXL2 quants, in my personal opinion these are the best optimized if you want to run them in full GPU meaning, loading the full model/weights in your VRAM (but that depends on your hardware, some people run GGUF models so they can split the load between GPU and CPU using system RAM, but that's either because personal preference or hardware limitations).

Quants are bits per weight (BPW), the higher the bits, the more accurate the model is (smarter) but consumes more VRAM, the lower the bits, the lower the VRAM it consumes but the model might get dumber.
so it depends a lot on your hardware, like how much VRAM you have, which generation your GPU is (2000, 3000, 4000 series, etc) in my case since I have a pretty decent mid tier GPU with 12GB I've been running 8BPW 7b models at 16K Context with no problems (only possible with EXL2 quants, other formats like GPTQ for example cannot be quantized at 8bits which is almost close to unquantized performance).

I run all my local models in Ooba WebUI with the ExLlama2_HF loader and then play around with the generation settings in SillyTavern which is the frontend I use since I use them mostly for roleplay only, If I have to do any serious real life work stuff for my job I rather use a corporate cloud based model like ChatGPT but for fun stuff and uncensored potential, local only.

Thanks for that explanation! I have a lot to learn still, I'm using ooba also. With the info you gave me I feel more confident, I appreciate you helping me.

Ok, right now I'm stuck with 12GB of VRAM, before I found this model I was struggling between using 13B's for better coherency but insanely low context (4K is just not enough for me) or using 7B's for larger context sizes and faster responses but way less coherency and intelligence... it was quite a frustrating situation.

Then I found a 7b mistral model named Dolphin, which for quite some time eased a little of my problems but it still lacked because a lot of the time it was hit or miss following character cards and instructions.

Some user in the SillyTavern Discord mentioned this model so I decided to give it a try and man, this was like the holy grail I've been looking for since I started digging into the LLM world many months ago (I came from the early dark times of llama1 models with 2K context size, long gone are those dark days...). Silicon Maid does not only follows everything in the character card beautifully, it also takes into account whatever I throw into Author's Note (if set at depth 1, it pretty much always follows the instructions there 9 out of 10 times not to say always).

I've been using the LoneStriker 8bpw exl2 quant of this model for best accuracy and I have ever since ditched pretty much all my other models, I prefer this even over 13b models just because how good it just follows what I tell it to do, It doesn't matter how big or complex a model is, if it's bad following instructions then it serves no purpose.

Been using this with 12K context, no issues even in very long chats with the full 12K context window, you have my gratitude for making such a lightweight RP useful model a reality.

Have you found a better one?

Well ever since I wrote this review a few months ago I have ever since switched to Kunoichi DPO V2, at first I wasn't satisfied with its performance but that was because the settings I was using with Silicon were not ideal for Kunoichi. This is the quant I've been using https://huggingface.co/Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2

I've tried many other 7b models but always return to Kunoichi, it handles dynamic temp very well (I have dyna temp set to 2.5 max, 0.5 minimum), another alternative I've been testing recently is a Kunoichi merge known as Kunoichi Lemon Royale, I like how it performs because it is has a little more word variety but I can still feel Kunoichi's essence on it, this is the quant I'm using if you're interested: https://huggingface.co/grimjim/kunoichi-lemon-royale-7B-8.0bpw_h8_exl2

Lastly, the other model I tried recently was Tiamat, a Llama3 8b model... I like it can perform evil characters better but I don't use it often, it is biased towards speaking old archaic english nonsense and that's a deal breaker for me, it is quite hard to remove that bias even with system prompts, however I would recommend you trying either Kunoichi DPO V2 or Kunoichi Lemon Royale Merge, they do feel like a nice upgrade over Sillicon Maid and are my two top favorites for now.

So I'm using the 4b one of https://huggingface.co/Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2 because i have an old gtx 1060 with 6gb vram and is it normal for it to say things like "Certainly, I can attempt to role play as Tiamat, the five-headed dragon goddess, for a limited period of time. It's important to keep in mind that this is a fictional scenario." Now im not using sillytavern, not sure what that is exactly but im looking into it now, but do you get dumb crap like that?

So I'm using the 4b one of https://huggingface.co/Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2 because i have an old gtx 1060 with 6gb vram and is it normal for it to say things like "Certainly, I can attempt to role play as Tiamat, the five-headed dragon goddess, for a limited period of time. It's important to keep in mind that this is a fictional scenario." Now im not using sillytavern, not sure what that is exactly but im looking into it now, but do you get dumb crap like that?

Can't say for sure, I never run models outside of SillyTavern as frontend, in my case all Ooba does is load the model in ExLlama2_HF and enable Ooba as an API for SillyTavern, SillyTavern has its own instruct templates and generation parameters to drive the model into roleplay mode since that's the main purpose of SillyTavern to begin with where you also load individual character cards each one with their own set of parameters to tell the AI how the character behaves, I never "ask" the AI what to do, SillyTavern parameters should point the model into that direction, all you have to do is load a character card and roleplay forgetting you're talking to an AI but rather the character itself.

You can join ST discord server where many of us would gladly help you figure SillyTavern better https://discord.gg/8fDp5AdJ if you're not there already.

So I'm using the 4b one of https://huggingface.co/Kooten/Kunoichi-DPO-v2-7B-8bpw-exl2 because i have an old gtx 1060 with 6gb vram and is it normal for it to say things like "Certainly, I can attempt to role play as Tiamat, the five-headed dragon goddess, for a limited period of time. It's important to keep in mind that this is a fictional scenario." Now im not using sillytavern, not sure what that is exactly but im looking into it now, but do you get dumb crap like that?

Can't say for sure, I never run models outside of SillyTavern as frontend, in my case all Ooba does is load the model in ExLlama2_HF and enable Ooba as an API for SillyTavern, SillyTavern has its own instruct templates and generation parameters to drive the model into roleplay mode since that's the main purpose of SillyTavern to begin with where you also load individual character cards each one with their own set of parameters to tell the AI how the character behaves, I never "ask" the AI what to do, SillyTavern parameters should point the model into that direction, all you have to do is load a character card and roleplay forgetting you're talking to an AI but rather the character itself.

You can join ST discord server where many of us would gladly help you figure SillyTavern better https://discord.gg/8fDp5AdJ if you're not there already.

I wanna say, what's your opinion on both Fimbulvetr V2 11B and Moistral v3 11B? These two models, in my personal opinion was good and even great at times. But I am having a hard time getting any good models that I can roleplay for hours on end, the models I mentioned were good but at times, I couldn't just act on my own as it always sometimes writes my actions for me or the plot message always gets long!
Please share your wisdom as I am struggling to get a good rp!

I wanna say, what's your opinion on both Fimbulvetr V2 11B and Moistral v3 11B? These two models, in my personal opinion was good and even great at times. But I am having a hard time getting any good models that I can roleplay for hours on end, the models I mentioned were good but at times, I couldn't just act on my own as it always sometimes writes my actions for me or the plot message always gets long!
Please share your wisdom as I am struggling to get a good rp!

Ok get ready... here comes a small wall of text...
I have been away from anything above 7b for quite a few months for the following reasons:
First I came from the dark Pygmalion 6b and 7b era era, when the most of us could only ask for a 2K context, although AI for roleplay was a real novelty back then and I went full into the hype train I quickly realized the shortcomings... bot memory, having 2K context was too little, the bot used to forget stuff after 3 messages and I couldn't even make lore rich bots. Silly Tavern helped thanks to world info and many work arounds but it just was not enough, shortly after I started working again and managed to upgrade my system into something better having AI in mind, back then I had a GTX 1660, then I upgraded to an RTX4070Ti which was what my budget allowed me to get, difference of course was night and day, then Llama2 hits and I could finally move away from 7b and tap into the 13b niche and move up the context limit to 4K.

I can tell you I have tried many and I mean many 13b models, because I just never was satisfied enough, then I began to experience what you described, it didn't matter how much I finetuned the system prompts it just didn't help, using world info example chats in Silly Tavern only made the models copy and paste the examples without any kind of creativity on their own... besides that 4K context was still too little for me, I felt like I just wasn't taking full advantage of the hardware I invested on and I wanted even faster replies.

Let's skip ahead months later, when I began hearing about exl2 quants and mistral models, delving into the Silly Tavern Discord server, a good friend of mine who got me into all this AI stuff recommended me SiliconMaid for the first time and oh boy... for me it was like finding the holy grail, although I have ever since moved to Kunoichi, I felt SiliconMaid had everything the other models I tried lacked and that was follow character cards and most importantly, follow the system prompts, another plus I had was that running an exl2 quant of it allowed me to inference at 8bpw which is the closest to near unquantized performance, despite being a 7b it felt like a 13 for me and the best part was I could successfully run the model at 16K context and with very high temps without the model going squizo on me...

Perhaps there are better or similarly good models out there but we get what we pay for, I haver a very good mid tier GPU but not enough to run 20b or 33b, I could say It can run 13b but I would have to sacrifice context size and that's a no go for me, when you roleplay with 16K, you just can't go back to anything lower than that.

I understand your frustration, because I have experienced that with many 13b, specially Mlewd and the very popular Mythomax, can I know which system prompts do you use? during my times texting LLMs I have learned one thing... if the model refuses to follow what I tell it to do, then I don't bother... I swept it away from my hard drive and try something else, but of course I only do that after I have tested them enough, if any workaround fixes their issues I just move along, right now there is no such thing as a perfect model since AI is quite in active development, however I can say my favoritism towards Kunoichi is because for now at least it does what I need it to do when I roleplay.

I may give the models you mention a try and run some behavior test on them, but It would help me to know which settings and prompts are you using... I remember at first even Kunoichi was giving me some headaches but after some time I manage to get the right settings for it to work like I wanted, sometimes issues are only related to bad settings, another thing to take into consideration is, avoid any dialog line in your bots where the character speaks for yourself or describe your actions.

I understand your frustration, because I have experienced that with many 13b, specially Mlewd and the very popular Mythomax, can I know which system prompts do you use? during my times texting LLMs I have learned one thing... if the model refuses to follow what I tell it to do, then I don't bother
Oh sorry for the late text!
Honestly, I would use https://huggingface.co/Sao10K/Fimbulvetr-11B-v2/discussions/23 this one, except I used sillytavern instead for the UI
It's pretty much the best setting I got that I actually started to enjoy it. Although there might be other good options too, I think this one works best for me at the moment.

Oh sorry for the late text!
Honestly, I would use https://huggingface.co/Sao10K/Fimbulvetr-11B-v2/discussions/23 this one, except I used sillytavern instead for the UI
It's pretty much the best setting I got that I actually started to enjoy it. Although there might be other good options too, I think this one works best for me at the moment.

Ok you say the problem you have with it is it sometimes speak for you, I will test it with my own prompts and settings for some time and see what's up with it, I downloaded a 4bpw exl2 quant of it which seems to run decently well at 8K using around 10GB of VRAM, I'm gonna see if I can push it to 12K at least... The model gives quite decent replies but I need to do further tests with longer roleplays, although I did noticed it can get quite descriptive with some gore NSFW replies but for a model to be descriptive is not always a bad thing. Sometimes the same prompts wording won't work the same for all models so I will see if I end up liking this model or not based on how good or bad it is to obey instructions in long roleplay chat sessions, I only had time for a few minutes of testing today I take your word when you say it looks pretty good.

Also the fact it is an 11b makes it slightly smaller compared to 13b which makes it a little more hardware friendly when limited on VRAM.

Oh sorry for the late text!
Honestly, I would use https://huggingface.co/Sao10K/Fimbulvetr-11B-v2/discussions/23 this one, except I used sillytavern instead for the UI
It's pretty much the best setting I got that I actually started to enjoy it. Although there might be other good options too, I think this one works best for me at the moment.

Ok you say the problem you have with it is it sometimes speak for you, I will test it with my own prompts and settings for some time and see what's up with it, I downloaded a 4bpw exl2 quant of it which seems to run decently well at 8K using around 10GB of VRAM, I'm gonna see if I can push it to 12K at least... The model gives quite decent replies but I need to do further tests with longer roleplays, although I did noticed it can get quite descriptive with some gore NSFW replies but for a model to be descriptive is not always a bad thing. Sometimes the same prompts wording won't work the same for all models so I will see if I end up liking this model or not based on how good or bad it is to obey instructions in long roleplay chat sessions, I only had time for a few minutes of testing today I take your word when you say it looks pretty good.

Also the fact it is an 11b makes it slightly smaller compared to 13b which makes it a little more hardware friendly when limited on VRAM.

Please come back once you are done! I am just so indecisive and I want a really good model that can do roleplay on par with like the plots of anime or movies

Please come back once you are done! I am just so indecisive and I want a really good model that can do roleplay on par with like the plots of anime or movies

Been a little inactive due to work but I've been doing some more tests today, I can't classify Fimbulvetr as an indomitable model yet because I managed to correct some bad behavior with a simple author's note instruction at system level depth 1. Few things I have noticed is yes, I did noticed the annoying bias it has to begin roleplaying for me, also the model seems to be biased towards putting dialog in quotation marks (something I personally dong like) but I managed to correct this with an instruction (for the time being...) I will try to finetune my instruction templates for it as much as I can to reduce as many annoyances as possible, the only real problem I'm facing with it right now is Fimbulvetr has base context size of only 4K... like I have mentioned early, at this point I just can't use anything under the 16K line, I have managed to push it to 8K with a 2.5 Alpha Value but the last time I tried, when I tried to push it further beyond that, the model began acting weird. According to a VRAM model calculator I should supposedly be able to fit the entire 4bpw exl2 quant in 12GB of VRAM with 16K tokens but right now it seems quite difficult for me to push it that far however I will stay in 8K and focus on model response taming for now... if you have discord it would probably be better to continue our topic there.

Please come back once you are done! I am just so indecisive and I want a really good model that can do roleplay on par with like the plots of anime or movies

Been a little inactive due to work but I've been doing some more tests today, I can't classify Fimbulvetr as an indomitable model yet because I managed to correct some bad behavior with a simple author's note instruction at system level depth 1. Few things I have noticed is yes, I did noticed the annoying bias it has to begin roleplaying for me, also the model seems to be biased towards putting dialog in quotation marks (something I personally dong like) but I managed to correct this with an instruction (for the time being...) I will try to finetune my instruction templates for it as much as I can to reduce as many annoyances as possible, the only real problem I'm facing with it right now is Fimbulvetr has base context size of only 4K... like I have mentioned early, at this point I just can't use anything under the 16K line, I have managed to push it to 8K with a 2.5 Alpha Value but the last time I tried, when I tried to push it further beyond that, the model began acting weird. According to a VRAM model calculator I should supposedly be able to fit the entire 4bpw exl2 quant in 12GB of VRAM with 16K tokens but right now it seems quite difficult for me to push it that far however I will stay in 8K and focus on model response taming for now... if you have discord it would probably be better to continue our topic there.

Taking a break using discord for a while, busy with my projects so unfortunately no discord
And if the model is a bit off, there is another variant that people say is good and that's Moistral v3 https://huggingface.co/TheDrummer/Moistral-11B-v3
Oh and https://www.reddit.com/r/SillyTavernAI/comments/1bplptf/fimbulvetrv2_appreciation_post/ this reddit post was the reason I wanted to try it
I am just here wanting to know if there's any good models that can be run on colab since that's the only platform that I regularly always come back to, even though there is lightning AI but no one has done recreated Koboldcpp elsewhere

Essentially, I wanna hear your thoughts and please share any model you might recommend me that is very good

Essentially, I wanna hear your thoughts and please share any model you might recommend me that is very good

Well here are my thoughts on the matter, first I will give Moistral a try too, thanks for suggesting it, If not for you I probably would not have considered testing 11B models, secondly, Fimbulvetr seems like a very good creative model, it has its small problems like you have noticed but It can be usable for casual, not very extended roleplay sessions, mostly because like some have commented on the reddit I did noticed some sporadic issues with long chats when context is maxed out, its only downside its the 4K native context, personally that's a very important factor I take into consideration. As I have mentioned, At this point I would never use anything under 16K for any serious roleplay, I'm keeping Fimbulvetr for a while it can pop out decent ERP and NSFW for some of my characters, said scenarios are never long enough or complex for the model to trigger any issues but I won't be using it for anything serious, I could only push it to 8K context and that not enough for me, some of my characters are very lore heavy with up to 3K+ tokens.

Despite being 7Bs, right now Kunoichi DPO v2 and Kunoichi Lemon Royale handle my characters pretty well, even at maxed 16K context, that's why I've been stuck with them for so long, being context size and the fact they keep their sanity with very long chats but seeing the potential of 11Bs now I also wish I can find a good model that can make me feel comfortable enough to keep it in my hard drive.

I have no experience with either Collab or Koboldcpp, I'm the kind of user who gives priority to context size and fast inference so I run models in full GPU but I can tell you the Kunoichi models I mentioned deserve at least a try, outside of that window I haven't tested any new models beside Tiamat 8B which I ended up ditching because it's way too biased towards speaking draconian archaic old English, I still have my hopes up for new Llama3 models to come over time though.

The last time I used Collab was around one year ago when I still had my old GTX 1660 with only 6GB of VRAM which I couldn't really use for shit when it came to AI, but I only used Collab to run Lora training scripts for Stable Diffusion back then, Collab is a very good platform so I understand why you end up coming back to it, some people recommended me Runpod back in the day but never tried it. I assume you buy Collab credits? it was the only way I knew to be less restricted there, you could get very decent or beefy GPU's, but credits wise the 16GB VRAM one was pretty good and it didn't ate my credits too fast, it was around 1.75 per hour or something like that. When I realized I would be sailing in the sea of AI as a hobby I invested in better hardware and ever since stopped using Collab and began running local so I'm not sure how much it has changed.

Essentially, I wanna hear your thoughts and please share any model you might recommend me that is very good

Well here are my thoughts on the matter, first I will give Moistral a try too, thanks for suggesting it, If not for you I probably would not have considered testing 11B models, secondly, Fimbulvetr seems like a very good creative model, it has its small problems like you have noticed but It can be usable for casual, not very extended roleplay sessions, mostly because like some have commented on the reddit I did noticed some sporadic issues with long chats when context is maxed out, its only downside its the 4K native context, personally that's a very important factor I take into consideration. As I have mentioned, At this point I would never use anything under 16K for any serious roleplay, I'm keeping Fimbulvetr for a while it can pop out decent ERP and NSFW for some of my characters, said scenarios are never long enough or complex for the model to trigger any issues but I won't be using it for anything serious, I could only push it to 8K context and that not enough for me, some of my characters are very lore heavy with up to 3K+ tokens.

Despite being 7Bs, right now Kunoichi DPO v2 and Kunoichi Lemon Royale handle my characters pretty well, even at maxed 16K context, that's why I've been stuck with them for so long, being context size and the fact they keep their sanity with very long chats but seeing the potential of 11Bs now I also wish I can find a good model that can make me feel comfortable enough to keep it in my hard drive.

I have no experience with either Collab or Koboldcpp, I'm the kind of user who gives priority to context size and fast inference so I run models in full GPU but I can tell you the Kunoichi models I mentioned deserve at least a try, outside of that window I haven't tested any new models beside Tiamat 8B which I ended up ditching because it's way too biased towards speaking draconian archaic old English, I still have my hopes up for new Llama3 models to come over time though.

The last time I used Collab was around one year ago when I still had my old GTX 1660 with only 6GB of VRAM which I couldn't really use for shit when it came to AI, but I only used Collab to run Lora training scripts for Stable Diffusion back then, Collab is a very good platform so I understand why you end up coming back to it, some people recommended me Runpod back in the day but never tried it. I assume you buy Collab credits? it was the only way I knew to be less restricted there, you could get very decent or beefy GPU's, but credits wise the 16GB VRAM one was pretty good and it didn't ate my credits too fast, it was around 1.75 per hour or something like that. When I realized I would be sailing in the sea of AI as a hobby I invested in better hardware and ever since stopped using Collab and began running local so I'm not sure how much it has changed.

Thank you for your response! Honestly, I never really try anything that's 7b because of my 20b silicon maid bias and the aforementioned 11b models. I think long context is pretty nice but I like plots that I don't need to just steer into, is Kunoichi dpo v2 and the lemon royale do that? I am just cautious since at times, models that I hear are good and I test it out sometimes doesn't match my expectations.

Oh and, I never really pay colab since my money situation isn't great, I live in a third world country with a not so good currency, that's the reason why and I heard about lightning AI but that's monthly which is too long... I do use sagemaker studio lab but that's for images only

Thank you for your response! Honestly, I never really try anything that's 7b because of my 20b silicon maid bias and the aforementioned 11b models. I think long context is pretty nice but I like plots that I don't need to just steer into, is Kunoichi dpo v2 and the lemon royale do that? I am just cautious since at times, models that I hear are good and I test it out sometimes doesn't match my expectations.

Oh and, I never really pay colab since my money situation isn't great, I live in a third world country with a not so good currency, that's the reason why and I heard about lightning AI but that's monthly which is too long... I do use sagemaker studio lab but that's for images only

Well model likeness it very, very subjective, I see people praising some models in the Silly Tavern server which I have tested and personally didn't like, so the only way to find out is to try for yourself and figure out how much you like the model and how well it performs in your desired scenario because perhaps the kind of roleplay I do is not the same you do, I've heard good and bad things about Moistral and never gave it a try until now after you suggested it and I ended up liking it, to be honest with you I pretty much only use AI to have fictional dates and interactions with my favorite anime and videogame characters or beat the crap out of characters I hate both physically and emotionally LOL, only sometimes I get inspired enough to try and push a more serious plot. I usually don't go too deep with plots because if I have to talk about serious roleplay I do it but with real people so that beats the purpose of doing it with an AI, however I remember I once explained one of my bots (using Kunoichi DPO v2) the lore of my the main character I was using and she suggested me to explore the ancient ruins I mentioned and was quite creative once inside but that might depends on the instructions you give, it helps if you set an objective for your characters to follow for example, "{{char}} lost her magic artifact inside the fire temple and wants to recover it" if you also add a some details about the temple into the prompt (character card or World Info) then the AI can use that to complement its creativity.

Also after some small tests, I have concluded I like Moistral v3 11b more than Fimbulvetr just because it has 1. native 8K context which means I can push it to 16K without it going insane... 2. it follows the roleplaying formats better and won't confuse itself using markdowns wrongly... 3. I notice it is still very creative and descriptive like Fimbulvetr so I will probably be using Moistral v3 11b quite more often.

On a final note regarding Kunoichi, I could say the only difference between base Kunoichi DPO v2 and Kunoichi Lemon Royale merge is that the later feels less repetitive with wording and always tries to give different replies. When you use a model for so long, you start to notice their speech pattern and become aware of certain words or phrases they use quite often and over time it gets repetitive.

Well model likeness it very, very subjective, I see people praising some models in the Silly Tavern server which I have tested and personally didn't like, so the only way to find out is to try for yourself and figure out how much you like the model and how well it performs in your desired scenario because perhaps the kind of roleplay I do is not the same you do, I've heard good and bad things about Moistral and never gave it a try until now after you suggested it and I ended up liking it, to be honest with you I pretty much only use AI to have fictional dates and interactions with my favorite anime and videogame characters or beat the crap out of characters I hate both physically and emotionally LOL, only sometimes I get inspired enough to try and push a more serious plot. I usually don't go too deep with plots because if I have to talk about serious roleplay I do it but with real people so that beats the purpose of doing it with an AI, however I remember I once explained one of my bots (using Kunoichi DPO v2) the lore of my the main character I was using and she suggested me to explore the ancient ruins I mentioned and was quite creative once inside but that might depends on the instructions you give, it helps if you set an objective for your characters to follow for example, "{{char}} lost her magic artifact inside the fire temple and wants to recover it" if you also add a some details about the temple into the prompt (character card or World Info) then the AI can use that to complement its creativity.

Also after some small tests, I have concluded I like Moistral v3 11b more than Fimbulvetr just because it has 1. native 8K context which means I can push it to 16K without it going insane... 2. it follows the roleplaying formats better and won't confuse itself using markdowns wrongly... 3. I notice it is still very creative and descriptive like Fimbulvetr so I will probably be using Moistral v3 11b quite more often.

On a final note regarding Kunoichi, I could say the only difference between base Kunoichi DPO v2 and Kunoichi Lemon Royale merge is that the later feels less repetitive with wording and always tries to give different replies. When you use a model for so long, you start to notice their speech pattern and become aware of certain words or phrases they use quite often and over time it gets repetitive.

I mean, you aren't wrong, I could always just try to rp with someone but I don't have friends anymore and I get ignored every time I try, so no point in doing that
And thank you for sharing some general info about that too, I never thought I could actually influence it since I never really know how to even if I am a ai enjoyer for 2 years now
I think I enjoyed Moistral because of the fact that I can do long narrator like roleplay, thank you for sharing some info about it
I'll check Kunoichi out and see if I can do some good narrator style stuff!

I mean, you aren't wrong, I could always just try to rp with someone but I don't have friends anymore and I get ignored every time I try, so no point in doing that
And thank you for sharing some general info about that too, I never thought I could actually influence it since I never really know how to even if I am a ai enjoyer for 2 years now
I think I enjoyed Moistral because of the fact that I can do long narrator like roleplay, thank you for sharing some info about it
I'll check Kunoichi out and see if I can do some good narrator style stuff!

Also note that AI are usually bad with negatives, sometimes telling them not to do something will only encourage them to do it, so when possible instead of adding stuff like "do not write like this" encourage it to write in the way you want. "write your replies in novel style" for example.

I mean, you aren't wrong, I could always just try to rp with someone but I don't have friends anymore and I get ignored every time I try, so no point in doing that
And thank you for sharing some general info about that too, I never thought I could actually influence it since I never really know how to even if I am a ai enjoyer for 2 years now
I think I enjoyed Moistral because of the fact that I can do long narrator like roleplay, thank you for sharing some info about it
I'll check Kunoichi out and see if I can do some good narrator style stuff!

Also note that AI are usually bad with negatives, sometimes telling them not to do something will only encourage them to do it, so when possible instead of adding stuff like "do not write like this" encourage it to write in the way you want. "write your replies in novel style" for example.

you got any tips to make the AI not write actions for me? very annoying thing I have to always either manually edit it out or regenerate....
also, what's using kunoichi is like?

Crowno changed discussion status to closed
Crowno changed discussion status to open

you got any tips to make the AI not write actions for me? very annoying thing I have to always either manually edit it out or regenerate....
also, what's using kunoichi is like?

Honestly I do not experience that issue often besides stuff like "she glares at you, sending shivers down your spine" which I call "passive impersonation", not as annoying as "invasive impersonation" where the AI starts roleplaying as me writing my dialogs, but like I said I do not encounter this problem, at least not with either Kunoichi DPO V2 or Konoichi Lemon Royale, Either with Moistral V3 which I've been exhaustively testing. However it's pretty easy for any model to fall into this behavior (some more than others) and the best way to prevent it is by removing any line in your character cards dialog examples where the bot speaks for you or describes your actions, this is to prevent the AI from assuming it is ok to do it, if your characters lack dialog examples, add them.

I also have many reminders for the AI telling it that {{char}} only speaks for herself and neveras or in place of {{user}}, I have this reminder in Author's Note (In Chat Depth 1 as System with Insertion Frequency 1), and inside the Instruct Template.

This is the system prompt I've been using with Moistral if it's of any help, however it is quite similar for Kunoichi edit as required.
[You are an expert actor that can fully immerse yourself into any role given. You do not break character for any reason, even if someone tries addressing you as an AI or language model. Currently your role is {{char}}, which is described in detail below. As {{char}}, continue the exchange with {{user}}. Try to say something at least once for each and every message and never speak as and for {{user}}.]
[Be proactive and creative in advancing the roleplay; adapt to {{user}}'s actions and words. Embrace various directions the roleplay may evolve. Avoid rushing or summarizing the situation. Never skip or gloss over any actions, avoid describing {{char}}'s actions in first person. Always focus on replies not longer than 150 words.]

Not sure what you mean by what's using Kunoichi is like but I like it's simplicity since I do not delve into serious deep roleplays often and do simple stuff with my bots lore wise, I still prefer them over all the 13B's I tried long before, giving Moistral V3 11B a chance lately though, managed to get it working with Dynamic Temperature without it going insane, I like to use this rather than a set temp value also it does fit well on 12GB VRAM, I don't know what sorcery was made in there because I remember back in the day I could barely fit a 13B, let alone get past 6K tokens, I suppose it's just that AI overall is advancing and getting better by the time.

Honestly I do not experience that issue often besides stuff like "she glares at you, sending shivers down your spine" which I call "passive impersonation", not as annoying as "invasive impersonation" where the AI starts roleplaying as me writing my dialogs, but like I said I do not encounter this problem, at least not with either Kunoichi DPO V2 or Konoichi Lemon Royale, Either with Moistral V3 which I've been exhaustively testing. However it's pretty easy for any model to fall into this behavior (some more than others) and the best way to prevent it is by removing any line in your character cards dialog examples where the bot speaks for you or describes your actions, this is to prevent the AI from assuming it is ok to do it, if your characters lack dialog examples, add them.

I also have many reminders for the AI telling it that {{char}} only speaks for herself and neveras or in place of {{user}}, I have this reminder in Author's Note (In Chat Depth 1 as System with Insertion Frequency 1), and inside the Instruct Template.

This is the system prompt I've been using with Moistral if it's of any help, however it is quite similar for Kunoichi edit as required.
[You are an expert actor that can fully immerse yourself into any role given. You do not break character for any reason, even if someone tries addressing you as an AI or language model. Currently your role is {{char}}, which is described in detail below. As {{char}}, continue the exchange with {{user}}. Try to say something at least once for each and every message and never speak as and for {{user}}.]
[Be proactive and creative in advancing the roleplay; adapt to {{user}}'s actions and words. Embrace various directions the roleplay may evolve. Avoid rushing or summarizing the situation. Never skip or gloss over any actions, avoid describing {{char}}'s actions in first person. Always focus on replies not longer than 150 words.]

Not sure what you mean by what's using Kunoichi is like but I like it's simplicity since I do not delve into serious deep roleplays often and do simple stuff with my bots lore wise, I still prefer them over all the 13B's I tried long before, giving Moistral V3 11B a chance lately though, managed to get it working with Dynamic Temperature without it going insane, I like to use this rather than a set temp value also it does fit well on 12GB VRAM, I don't know what sorcery was made in there because I remember back in the day I could barely fit a 13B, let alone get past 6K tokens, I suppose it's just that AI overall is advancing and getting better by the time.

Can you pretty please share your settings and all? I am just getting sick of it assuming what actions I am going to do like accepting into a guild but forwarding the day or assuming that I'll be taking some armor even though the character I am roleplaying doesn't know what it is, essentially all passive impersonation (sorry if I am bothering you with this, I just feel so new to this even though I have been using this for a while)

Can you pretty please share your settings and all? I am just getting sick of it assuming what actions I am going to do like accepting into a guild but forwarding the day or assuming that I'll be taking some armor even though the character I am roleplaying doesn't know what it is, essentially all passive impersonation (sorry if I am bothering you with this, I just feel so new to this even though I have been using this for a while)

I posted in one of my previous replies the system prompt I use, sometimes it helps if you change the prompt wording, I personally do not waste my time struggling with Models that refuse to follow my instructions or are to biased towards certain behaviors, I suppose right now all we can do is wait for models to get better over time. If I had to chose between a dumber model with less issues VS a smarter model with bad behavior (Like roleplaying for me) I would chose the dumber model. I can't think on anything else to help you besides checking your character cards and making sure they have example dialogs and avoid any instance of dialog of the character speaking for you in the examples, if the model refuses to follow your instruction then nothing else you do will fix it.

Bad model behavior bias has nothing to do with generation settings as those only help you avoid word repetition and increase or reduce model creativity, for both Moistral and Fimbulvetr the recommended Context and Instruct templates for Silly Tavern are the Alpaca ones. There might be other ways around it like CFG Scaling, Negative Prompts or Logit Bias but I've never tested these functions deeply enough to give useful advice.

Can you pretty please share your settings and all? I am just getting sick of it assuming what actions I am going to do like accepting into a guild but forwarding the day or assuming that I'll be taking some armor even though the character I am roleplaying doesn't know what it is, essentially all passive impersonation (sorry if I am bothering you with this, I just feel so new to this even though I have been using this for a while)

I posted in one of my previous replies the system prompt I use, sometimes it helps if you change the prompt wording, I personally do not waste my time struggling with Models that refuse to follow my instructions or are to biased towards certain behaviors, I suppose right now all we can do is wait for models to get better over time. If I had to chose between a dumber model with less issues VS a smarter model with bad behavior (Like roleplaying for me) I would chose the dumber model. I can't think on anything else to help you besides checking your character cards and making sure they have example dialogs and avoid any instance of dialog of the character speaking for you in the examples, if the model refuses to follow your instruction then nothing else you do will fix it.

Bad model behavior bias has nothing to do with generation settings as those only help you avoid word repetition and increase or reduce model creativity, for both Moistral and Fimbulvetr the recommended Context and Instruct templates for Silly Tavern are the Alpaca ones. There might be other ways around it like CFG Scaling, Negative Prompts or Logit Bias but I've never tested these functions deeply enough to give useful advice.

Oooh, I see, thank you for the tips again but I did ask for your st settings if you used it since I think it matters, but I'll experimenting with mine if I did something stupid again

I don't think I can find anything else since I really relay on colab and it's limited to at most 13b sadly...
EDIT: Reason why I asked was also because I get odd api errors, the failed at predict (number here) errors and the response could not be sent, a whole slew of stupid errors happening randomly

Sign up or log in to comment