Almost the same as what I have been planning!
What a coincidence! This is almost the same as what I have been theorising and planning to do: I had the idea that consecutive slices should feature adjacent layers. This should reduce the loss of consistency, and act more like a reflexion. And then a sufficiently large undisrupted slice need to be at the beginning, to ensure correct parsing and interpretation of the input data, and same at the end to ensure a coherent reply is produced as output.
I was thinking of different ways of scaling it:
XXL model (180B - 206 layers)
- initial slice: 10 layers
- intermediate slices: 3 layers
- final slice: 10 layers
- ie: 1-10, 9-11, 10-12, 11-13, ..., 69-71, 70-72, 71-80
XL model (126B - 144 layers)
- initial slice: 10 layers
- intermediate slices: 4 layers
- final slice: 10 layers
- ie: 1-10, 9-12, 11-14, 13-16, ..., 67-70, 69-72, 71-80
L model (107B - 122 layers)
- initial slice: 11 layers
- intermediate slices: 5 layers
- final slice: 11 layers
- ie: 1-11, 10-14, 13-17, 16-20, ..., 64-68, 67-71, 70-80
M model (98B - 112 layers)
- initial slice: 11 layers
- intermediate slices: 6 layers
- final slice: 11 layers
- ie: 1-11, 10-15, 14-19, 18-23, ..., 62-67, 66-71, 70-80
This gave me some of the best results so far but I don't think it beats the original. Not sure if the idea of identifying most/least important layers through exl2 measurements makes much sense, I could try doing the same for the entire model instead of just some layers.
Here's a test of such a model: https://huggingface.co/llmixer/BigWeave-v28-96b
Thanks, I am now running my own tests, as it is much faster for me to do the merges on my computer than download the models. Once I have some useful results, I will let you know. Thank you for providing your yaml files too, I will use that to reproduce your most successful merges at home.
Is this still your best model out of all miqu-1-70b self-merges?
And how you evaluate them?
I would love to see your method and results. Have a look at what I did for froggeric/WestLake-10.7B-v2-GGUF
Previously I used PPL to have a relative comparison between the models but the calibration set, exl2 version and VRAM amount changed in the mean time so I no longer have a good way to compare them. For these last attempts I just tested them manually by talking to them for a bit but so far the best model I've tested is still a non-frankenmerge (Midnight-Miqu by
@sophosympatheia
). I've tested miquella, goliath, etc. at 6bpw and they are not as smart as an 8bpw Midnight-Miqu in my opinion. Neither is the 103b version of Midnight-Miqu.
To get the best quality I've made a script to generate a measurements.json that uses the highest bpw quantisation for each layer so there's no need for actual measurements (for 80 layers: https://pastebin.com/Baz6ax1E). It uses a bit more space but it's not significant.
My conclusion so far is that 70b@8bpw > 100+b@6bpw so maybe frankenmerges help mainly at lower bpw.
Hey, @llmixer . I agree with you that the 70B version of Midnight Miqu, even at 5bpw, seems to outperform its 103B version. It's the first time I've felt that way. Before Midnight Miqu, it always felt to me like the 103B version of a model was straight-up better than the 70B version. It could be related to the bpw, but I've tested my 70B models at 5bpw before and this result is just different. My hypothesis is that these merges with miqu need to be handled differently in the frankenmerge process, but I don't know exactly what that looks like right now.
I applaud the work you're doing to seek a solution!
By the way, I released Midnight Miqu v1.5 the other day. I would like to get your opinion on it. Some people seem to think v1.0 was smarter and might be a better generalist, whereas others seem to think v1.5 is an improvement. What do you think?
@sophosympatheia I have added Midnight Miqu v1.5 to my benchmark. I have started running the first few tests, at q8_0, and so far I am impressed. It is the first model I have tested, apart from Claude 3 Opus, that seems to understand humour and made me a laugh a few times! The scores for the first few tests are great, and if it continues like this, it is likely to end at the #1 spot in my Creativity benchmark.
Nice! I look forward to hearing more about your results. Thanks for giving it a thorough testing.
@sophosympatheia
Thank you! It could well be that miqu is different in that regard. I'll continue to do experiments if something new comes to mind.
I did some (non-systematic) testing with 1.5. It's very similar, but certainly not worse, so that's good :) Maybe slightly better adherence to context, but it could also be within the usual variations when re-generating responses.
I did some (non-systematic) testing with 1.5. It's very similar, but certainly not worse, so that's good :) Maybe slightly better adherence to context, but it could also be within the usual variations when re-generating responses.
Your assessment aligns with my own. v1.5 is very similar to v1.0, but v1.5 seems to pay more attention to context. I released it because it passed a test that relied on contextual awareness that was tripping up every other model including v1.0.
I'd say so, yeah. But of course all of them have been surpassed by new model releases and I still think that merges aren't really superior to the base models. I now just run base models at the maximum possible bpw.
I've done some new experiments with llama3 inspired by @jukofyork (https://huggingface.co/wolfram/miqu-1-120b/discussions/4). They do work but the base model is still better. I'll upload them soon (v31-v33).
Among other things I've used the following simplified variation of the Sally prompt:Sally has 2 brothers. Each brother has 1 sister. How many sisters does Sally have?
The base model pretty consistently says zero while the others will more frequently say that she has one sister, herself.
I'd say so, yeah. But of course all of them have been surpassed by new model releases and I still think that merges aren't really superior to the base models. I now just run base models at the maximum possible bpw.
I still think merges have potential to dial down "excessive positivity" and never seem in such a rush to get to the ending... Plus their borderline "unhingedness" also can make them more creative (even if 4/5 of the stories make no sense - one good story is all it takes!).
My hope now is to use some of the merges to create "seed stories" for the better models with longer context (eg: command-r-plus
) to continue. It seems if you can get a good "seed story" going all the positivity crap gets dialled right down - perhaps because the the frankenmerges' "seed stories" are so out of line with normal prompts they have been aligned with and/or having a good chunk of context forces the autoregressive prediction to actually continue the story, rather than try to wrap it up with "they all lived happily ever after" type stuff?
I just wish there was a more scientific method to evaluate them :/ I think the most scientific way is probably just to try to find the boundaries of where they break down and then choose settings that try to sit in the centre of these boundaries... It's a lot easier to judge broken/not-broken that fine nuances IMO.
Yeah I agree with that. It's not an exact science and maybe the right kind of chaos or "defect" can actually be beneficial depending on the application. I value instruction following and precision a lot so model merging is probably not the best approach for me :)
That said, I'm still fascinated by the fact that we can mess around with models that much and they stay coherent.
Yeah I agree with that. It's not an exact science and maybe the right kind of chaos or "defect" can actually be beneficial depending on the application. I value instruction following and precision a lot so model merging is probably not the best approach for me :)
Yeah, I've had 100% failure rate with merging coding models (so far), but I did learn a lot by trying so not totally wasted effort :D
That said, I'm still fascinated by the fact that we can mess around with models that much and they stay coherent.
Yeah, it's interesting how badly they break when you mess with the layers that are transforming the embeddings in/out of the latent space too... Make me think that there is probably a lot better method of creating the embeddings that is yet to be found.