The generation stops abruptly sometimes

#2
by TheYuriLover - opened

Hello,

I noticed your model had some abrupt stops during generations, like the sentence stop right away without even finishing. I noticed a similar issue with the older model vicuna-free, I guess this same training method still has this bug.

Cognitive Computations org

I get that too, rather often.
I wonder if there is some kind of contamination in the dataset and it's coming from the model, or if some bug in our inference engines.

Cognitive Computations org

must be a fastchat thing. I'm on axolotl now, and not seeing that kind of problem

Cognitive Computations org

That would suggest issue with the tools used for inference, rather than model.
I'll test more with different loaders, engines and quantizations.

Cognitive Computations org

I couldn't reproduce it yesterday, on the fresh install on my obabooga (I reinstall it every few days).
So either something got fixed, or it was due to something stupid on my part, like forgetting to change settings after installing it, for max tokens limit etc.

I get the same problem - the responses are truncated at some point regardless of how many of the context is used. Banning the EOS token solves the problem, but then he just goes on forever. I think I am using the correct format for all the prompts, but it seems like there is some problem with it. Specifically adding the EOS token prematurely. Don't know how to solve it though - maybe some of you guys have an idea?

Cognitive Computations org

you should use a newer model. LLaMA2-13B-Tiefighter maybe, or OpenHermes-2-Mistral-7B

Yeah probably, but I want to use it as a roleplay chat and so far your WizardLM model really hits the tone best. The other ones I tried out either talk like children or come around the corner with some weird suggestions like "I can't harm innocents" and "We cant risk losing our humanity" for a character with extremely psychopathic tendencies. That kinda ruins the illusion^^

I think I kinda found a way around it by modifying the prompt and banning the EOS token. Now it mostly works but sometimes the model still overshoots with statements like "End of Roleplay". Do you know if it's possible to introduce a special token that reliably gets put at the end of the reply? Or custom-stopping strings like the Users' character name?

Cognitive Computations org

While I didn't do enough testing to put my finger on anything confidently, I've seen that issue more often surfacing with quantized models at 4bit, more so than at 16bit or even 8bits.
Almost as if during quantization, there's some non-zero chance for models to forget EOS tokens, or for those to lose some weight, enough cause issues.
On the other hand, I've not ran in to similar issues with models fine-tuned using ChatML format, yet. At least those few I did test a bit, all been following proper "grammar" without losing plot, even when quantized.
So I'm hoping that's going to be less of an issue in the future, with more groups transitioning to stricter formats in their future fine-tunes.

Could be - I only briefly tested it by loading a different model in the same chat. The other ones seemed to be able to continue without problems - they were all 4bit quantized models. Doesn't mean much though, could be that their specific error would surface under different circumstances. Other than that I noticed that it doesn't seem to make a difference whether you choose a GPTQ or GGUF format - same error occurs.

I tried out something with modifying the prompt and it seems like it's working - Eos token is banned and I instructed the model to end each thought with the generation of "add end_of_thought_N after each paragraph with N counting upwards". Combined with a custom stopping string at End_of_thought_2 it's at least working as intended. Granted, it's a dirty workaround but it seems to work for the time being.

Cognitive Computations org

I imagine models that suffer for that can be forced to work with some additional prompt crafting.
Or alternatively, toning down quantization, to 5-6 bits, rather than 4 and bellow, might be another workaround. Both GPTQ and GGUF formats are efficient, but a bit brute, in the way models get quantized.
This week I'm away from home and from my machines, so can't test it until at least mid next month, but I'll do more testing when I'm back home.

That sounds good - after your suggestion I tried it with the 8bit version. Got an error upon loading it though, so I can't say whether that makes a difference yet. I noticed though that the model seems to perform very well for the initial part of the conversation and only later got off the rails. I think at a certain point the context transcends what the model is able to meaningfully interpret.

I was thinking of diving a bit more into AutoGen and memGPT to basically relay the task of building a suitable prompt to different agents. Different agents could then factor in different aspects of the character like abilities, past memories and evolving relationship to the User Character in addition to the chat history. In combination the agents would then craft a specific prompt for each interaction and I hope that would make things a bit tidier and less taxing for the model to have to handle.
If I find something out that works, I'll be sure to put it here - would be a shame if this model would be passed by the delevopment of time - cause personally I really like it.

Have fun on your trip! Safe travels!

Sign up or log in to comment