On using a base model as a component model
Hey, so I've been experimenting with sparsifying merges, and I noticed that here, you included Ataraxy both as a base model and a component model. That's not something you have to do, but I'm wondering if you notice any improvement doing that vs. not? Or does it come out the same?
Hi. It certainly changes the behavior of the model, but I'm not sure in which way yet and whether it can be called an improvement. After small tests, it feels like more of the base model is retained with this approach. I maybe need to read more in depth about how DARE TIES works and experiment more with it for more useful info. Though if you already have information on this I'd be interested to hear it.
I made a version of the same merge but without adding base in the components, if you're interested in comparing for yourself.
It certainly changes the behavior of the model, but I'm not sure in which way yet and whether it can be called an improvement. After small tests, it feels like more of the base model is retained with this approach.
I see, thank you! I'll compare them myself. I was curious because my current project is also a DARE-TIES merge, and while that first version is excellent, it also suffers from some coherence issues at long context. I was wondering if adding the base model as a component might help.
I maybe need to read more in depth about how DARE TIES works and experiment more with it for more useful info. Though if you already have information on this I'd be interested to hear it.
I sure do!
First, it's helpful to know that all these methods are based on Task Arithmetic. Task Arithmetic, and by extension TIES/DARE/DELLA, all work on task vectors, which is what you get when you subtract a base model from a finetuned one. It captures the difference between them, which represents the "task" that the finetuned model is intended for, hence the name. It's the vector along which a base model must travel to get to the finetuned version, if that helps.
What TIES, DARE and DELLA do, versus standard Task Arithmetic, is sparsify the weights of the component models (that is to say, turns their values to 0, keeping a percentage, represented as a decimal by density
.) But how they do the sparsification differs.
TIES drops the weights with the smallest magnitudes.
DARE drops weights completely at random.
DELLA also drops weights at random, except it also applies "drop probabilities" to parameters based on their magnitude (so parameters with smaller magnitude are more likely to be dropped.) The amount to which magnitude can affect the assigned drop probabilities is controlled by
epsilon
in Mergekit.
Each method also has a unique behavior:
- TIES also provides a method for further reducing interference between models, called "sign election". It works like this:
- For each parameter, each model is separated into 2 piles, the ones where that parameter is positive, and the ones where that parameter is negative.
- The absolute values of the parameter in question are totalled for each pile.
- If the "positive" pile has a larger total, then the parameter is elected to be positive, and vice versa.
- When calculating the actual value of that parameter, only the values from the winning pile are considered.
DARE and DELLA, after sparsifying, multiplies the parameters' magnitude by the inverse of density
(e.g. density: 0.5
would mean the magnitudes are multiplied by 2) before doing the merge calculations, then scales them back down afterwards. DELLA, however, lets you specify a final scaling factor (lambda
) for the merged weights, meaning the final task vector can be made more or less "potent" to taste.
TIES' sign election mechanism can also be used with DARE and DELLA as well. DARE uses it when the merge method is set to dare_ties
(as you've done here), and DELLA uses it by default, and it's disabled by setting the merge method to della_linear.
...Don't ask me how Model Stock works though, I don't think anybody knows the answer to that.
Wow, I wasn't expecting such a big response. Thank you very much. Then all I have to do is dive into the math under the hood of it all to fully understand, but I don't have time for that yet sadly.
I see, thank you! I'll compare them myself. I was curious because my current project is also a DARE-TIES merge, and while that first version is excellent, it also suffers from some coherence issues at long context. I was wondering if adding the base model as a component might help.
Got it. I tried your model and it's really good. Good job, thanks. It would be great if the context problem could be fixed. I'm more of a fan of your Blackout though, as you might have guessed. Trying to combine it with the other two models I like, to get rid of the feeling when using each of them separately that something else was better with the other model. But no luck so far. I think I'll throw them at the wall until something sticks.
Haha, I appreciate that!!
Blackout needed very little work, it's basically as good as I can get it for my needs without having a friend train up a module for domain knowledge (on which Gemma2 base is already pretty good.)
Trying to combine it with the other two models I like, to get rid of the feeling when using each of them separately that something else was better with the other model.
That's usually how it tends to go, yeah. Keep at it; lots of trial and error here!