I have a question

#6
by BigBeavis - opened

Hello there! Would you mind sharing details about the merge? I want to attempt a similar experiment with different models, it would probably help a lot to know which layers you chose for the merge and why. And also how it differs from the Madness version.

Hey;

Check out the merge kit here:

https://huggingface.co/DavidAU/L3-Stheno-Maid-Blackroot-Grand-HORROR-16B-GGUF
(sub in the "model makers name" in place of the FILENAME to use Mergekit online).

This has base layer of 32 - all models are 32 layers.
To use this for Mistral Nemo (models); SCALE this up to 40 layers (divide the layer number by 32 then times 40 - do this for every "layer number" in the merge).

You can then use this for merging Mistral Nemo models of 40 layers.
I suggest you then experiment SLOWLY with changes - one LAYER at a time.

All layers of the model interact with all other layers - so great care must be taken.
Follow the setup - two blocks of 3 models each AND ORDER of the models.

I use the Grand Horror 16B "template" (as well as "scales" of it) in some form or another for a lot of my models.

RE: Madness / Darkness are compressions of 23.5B (down to 40 layers).
Compressions are complex and rife with "trial and error" + testing.

The trade off: Lose some creativity/"character", gain some stability... and these can be merged with other Mistral Nemo models. (40 layer default)

Thank you for the reply! I think i see the basic logic behind the way you sliced them. In hindsight doing it in 6 steps for 3 models instead of 3 steps seems so obvious, and yet i didn't even think about that.
I made a mistake, by Madness i didn't mean the compressed versions, but the first version/Untamed one - DavidAU/MN-GRAND-Gutenberg-Lyra4-Lyra-23.5B-GGUF , i was curious about the difference in slicing formula.

I'm checking out the different versions of Grand Horror you linked. In some versions you're applying a scaling parameter to the layers, and you go so step-by-step, like for example stheno layers range 0-14 scale 1, 14-20 scale 0.8, right after that you add 20-25 with same scale (why? why not just write it as 14-25 with 0.8?), then 25-27 with 0.6, and finally 27-28 with 0.9, at which point i'm extremely curious about how you even found out what layers 25-28 do to prescribe them such specific values.

RE: 23.5B / 23B:
In 23B I trimmed a few layers off, adjusted positions.

RE: "Scale".
Big topic.

First -> I change TWO parameters directly , and the third is a default of for the rest of the "weights/gates". (there are 10+ in total).

Changes to these are based on testing, as well as an understanding that certain values (and range) will work, whereas others will introduce instruction following and/or output generation issues dependent on location/layers in the model (both the merge and direct location (IE BASE : 32 layer model, 40 layer model)).
I don't use all 10 because that is a recipe for madness ; these three are very powerful on their own and keep me sane when it comes to testing/adjustments.

A lot of this knowledge comes from reading papers on merge theory, layer theory, general model operation theory and a whole lot of trial and error.
The last part is an understatement.

RE: 25/27 -> at approximately 3/4 layer area of the model (scaled to 32, 40 for Llama 8b / Mistral Nemo 12B respectively) are critical control layers that cross connect to other layers.
That is a really too short of an answer... ; but the raw gist of the importance of these specific layers.

Likewise "scale 1" is "normal" scale or size.

Thanks a lot! I guess i'm in for a ride.
As for the trial-and-error part, do you have any specific testing routines, or do you just go by feel?

RE: Testing;

See both "Basic Parameters" section AND the very LAST section on this page [see index]:

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

You basically want to use some "test prompts" , with same quant and same settings and temp=0 ;
You get a baseline first, then modify the "stack" -> retest.
Then test "at temp" ( this is explained in the docs above ).

Keep in mind changes can be very larger relative for PASS-Through method compared to other merge methods like DARE, Breadcrumbs and Task art etc etc.
Likewise difference "archs" will behave differently when using this mergekit method.

Thank you so much, your advices nudged me in the right direction quickly, i looked up some info on general distribution of roles among layers in llms, and i think i got the gist of where to start experimenting. I'm still a bit lost on the control layers, as in, how do i even find them, and if i even need to worry so much about them for now when not using scale filters.

Hi, David, it's me again! I've been figuring stuff out on my own in the past couple of days.
I've tried asking gpt-4o for some advice, although the data it provides is contradictory between regenerations, some things remain constant, so it was somewhat helpful to kind of understand the gist of what layer ranges are responsible for what in general.

The gist that i got is that the early layers ( ~25% ) are responsible for general token embedding, grammar, token relations, etc. The language system itself for the model to operate with, i guess.
Then the middle layers ( 25 ~ 75 % ) contain context for the model to fall back on for reference. Basically, the knowledge pool. What's X and Y and Z and etc. and how they relate to each other.
And the top layers contain the highest-concept data, that most directly influences the model's reply structure. Formatting (like ChatML vs Mistral Instruct), styles, mood, verbosity, etc. Knowledge too, but i guess the most relevant kind for performing tasks the model is specialized for.

I basically wanted to hear your thoughts, based on your experience and knowledge, whether this representation is accurate?
Because, assuming it is, i can surmise that it's possible to stitch together models which use completely different formatting as long as the top layers are dominated by models with singular format and not run into associated issues?
This also gives me thought that for a passthrough merge, maybe it's best to not focus much on the first 25% of layers, mostly leaving their number untouched compared to base models, and instead pile on more layers from the middle and the top? I also thought that maybe it's best to use the actual base model for those lowest levels, since they wouldn't have any degradation associated with finetunes? Those being the foundational layers, which don't seem to be very influential for style and substance anyway, perhaps could benefit from being as clean as possible?

Gpt also helped me plan out the gist of a complex merge involving 6 models, but i guess it's one thing what gpt thinks is going to happen, another what clueless me thinks, and third the reality to follow. So i guess i'll be trying out various configurations in the following days. Leveraging slices for 6 models is way tougher than for 3, following your example, especially when i'm going in blind now.

A few notes - (only applicable to pass-through merges!):

1 - Generally I stay away from the 25-50% of the first layers - in the first block.
2 - Knowledge is embedded in every layer, and some layers cross connect.
3 - The last 1/3 of the model (minus last layer) is stronger and stronger levels of nuance.
4 - The final layer MUST NEVER be duplicated, you can only have ONE (ie layer 32, layer 40) => or the model will crash and burn. (there is an exception, but it very complex)

RE: Formatting, style - Can't comment on this, never tested for this.

The problem with pass-through merges is also their power - you have 100% power of multiple models "competing" to change each token.
Sometimes 2 or 3 layers (of the same layer number - regardless of location in the model) also competing.
And the more layers => More change.

Likewise this can also create what I call a "cascade" (2-3 layers above AND other layers interacting) -> Interactions that would never happen normally.
=> Super Creative.
=> You now have a model greater than the sum of its parts.

Models you could merge via other methods rarely come close to pass-though merges' power levels.

In fact a lot of people are "dollar cost averaging" using other methods, which drive me crazy, because they are leaving so much power on the table.
I don't take this approach (ie Dare, Dare-Ties, Breadcrumb etc).

If you want to try out a full precision Dare Ties merge see this model:
https://huggingface.co/DavidAU/Gemma-The-Writer-9B-GGUF

This is an example of when you DO NOT "average out" a model.
(Any of the Dark PLanet Series - 8B sizes - same -> A focused merge, not "averaging")

However the downside is with pass-through merges is these interactions affect model stability.
So... it is a fine needle to thread.

I would suggest using a few pass-through merge formulas and changing out model positions to see differences first - keeping the general flow the same.
You may also want to check out 2 model pass-through merges like "Psyonic-Cetacean 20B" / "Black Forest 20B" (this is complex one)
(search for on HGF, the merge formulas at org repos)

For a real challenge:
https://huggingface.co/DavidAU/TieFighter-Holodeck-Holomax-Mythomax-F1-V1-COMPOS-20B-gguf

This is a ("nightmare!") FOUR model, pass-through merge.

Sign up or log in to comment