Hi!
I found your model and have some unsolicited tips that you didn't ask for. Mergekit acts funny when it comes to layer_range. It would make sense for the range to be 0 - 23 with 0-indexing, but that is not how mergekit rolls. I have also tried some different configurations of stretching danube2, and without further training the following yaml file only loses a couple of percentage points.
slices:
- sources:
- model: h2oai/h2o-danube2-1.8b-chat
layer_range: [0, 23]
- sources:
- model: h2oai/h2o-danube2-1.8b-chat
layer_range: [13, 24]
merge_method: passthrough
dtype: bfloat16
This results in 34 layers and a chat model, where I see you use the base version. Sticking the new layers in the middle tends lower the different scores, like winogrande and hellaswag, with about 5, or a bit more, percentage points before more finetuning is done.
You got some very interesting projects going. Well done :)
Thanks for the insights of experience! I have definitely found that mergekit layer slicing behaves ... less than intuitively; I've taken to working more directly in transformers for some of my other efforts to stretch models with themselves. :)
... Oh, you made NinjaMouse! That makes sense!
Yea, I'm sorry for not being pretentious enough ^^ I was actually looking for other models to merge into v0.2, but the stock model is either capable enough without additional training or hidden behind silly names like NinjaMouse.
Danube2 seems to do well according to training and eval loss when it comes to science, programming, and general assistance, but I couldn't get RP and storytelling to behave in a predictable way. Or at least in an intended way. Did you overcome that somehow? I ask because I have a nagging feeling that 7B parameters could be a minimum for the amount of required knowledge to make that kind of creative writing work. I hope I am wrong though.
For 1.7, I did try out putting in Severian/Internal-Knowledge-Map-StoryWriter-RolePlaying to try to seed future understanding. For evaluations, EQ-Bench seems like a quick one that tracks with roleplay ability more than other benchmarks do; I have had some interestingly high numbers on EQ-Bench in comparison to other assessments when doing merges involving specifically RP models. By that standard 1.7's better than base by a notable degree -- 15 points vs 0.05 from the base model, according to my gists.
I think my highest EQ-Bench for a currently private effort on a danube upscale is 23.9 at 2.7B parameters... That appears to have been instruction-tuned, by myself using some RP-focused datasets, though I'm not exactly using it currently (it's still at least 5% broken in being parseable on the test, for one thing). I have concerns about Phi or Cosmo's base dataset incorporating enough useful for fine-tuning to work out right, though Phi might pull things off alright by in-context learning.
Have you tried much adding instruction tuning to a base model? Or near-base model, in Qwen2 case, it looks like it has some light instruction training already but probably not the full SFT+DPO sort of deal that the chat and instruct versions have? I expect even with a well-enough-distributed initial dataset, a lot gets 'locked in' and forgotten by an instruction training regime that intensive.
IKM is an interesting approach. I'm kinda surprised that language models have an easy time with markdown. I would have guessed that json or xml would be more suitable to structure such data. And I am just full of wrong guesses; I have not tried instruction tuning a base model. Or, up until 2 days ago I hadn't and gods damnit! You are most likely right. It's literally in every README come to think about it.
I struggle with Qwen2 by running out of VRAM, even with 1.5B, but I am certain that it is a me problem. I did just run EQ-Bench on the abliterated version with the following results from lm-eval: 31.4% with 88.9% of responses being parsable. Which is wild compared to mine and the chat tuned danube which are in the negatives.
It really seems like fine-tuning stifles creativity and even emotional understanding when done in excess. Thank you for introducing me to EQ-Bench and making me face my ignorance about base models. Lastly, I would like to ask you for permission to merge 1.7 into the last version of NinjaMouse. My initial testing yielded good results when doing a model stock merge, but being kind and respectful is more important than making numbers go brrrrrrrrrrr.
Looks like abliteration hurt its parsing -- Instruct model of Qwen2-1.5-Instruct got 30/97% parsable when I tested it. Solid model, but ...
... I don't think it's a you problem - Qwen2 has a massive vocabulary compared to the Mistral-based models, and like Danube has a lot of attention heads. My rough understanding is that due to the embedding layer size and the way number of attention heads increases training time, some of these small models take a lot more to train than others. (Gemma's a worse offender on the vocab front.) I've been working on some ideas for base mini-foundation models that are quicker to train for hobbyists and can be worked on with the 24GB vRAM cards; most recently I've been focused on Mistral v0.3 as a 'base' config architecture to build on. 32768 vocabulary seems like a great balance (fully exploit that power of 2), and it's intuitive enough to replace the control tokens as you need to train new ones. Additionally, a lot of library support likes Mistral and Mixtral architectures (though it's still tricky to figure out how to train those MoEs).
(This is a bit of a slog ... even if it's working, you look at the 100 hours it took to train on 1.5B tokens, then look back at the 1T Danube trained on ...)
Feel free to merge any public model I trained that has a license suited to you, including 1.7.