Non-lasered version should be better

#1
by froggeric - opened

Currently benchmarking the GGUF q4_km version and so far this looks very promising, probably even better than the original Westlake v2. I will update you once I finish.

However, I have already benchmarked the original, the laser version and the laser truthy dpo. The original is the best by far! Here are my benchmark scores (24 tests, 50% sfw, 50% nsfw):

WestLake-7B-v2-GGUF
Total: 77 (sfw: 38 + nsfw: 39)

WestLake-7B-v2-laser-GGUF
Total: 67 (sfw: 34 + nsfw: 33)

WestLake-7B-v2-laser-truthy-dpo-GGUF
Total: 65 (sfw: 37 + nsfw: 28)

I have definitely noticed a big quality difference between them. And something I was not at all expecting, Westlake v2 is the top scored out of the 50 models. from 7b to 20b, in my benchmark! This includes PsyMedRP v1 20b (2nd place), LlaMa2-13b-Estopia (4th place), the noromaids, etc.

If you already have a non-lasered version with the original Westlake, I would be very interested to test it (in GGUF format please). I expect it should be even better than this one.

I do not have a non lasered version right this second but give me ~1 hour and I’ll ping you with the link to it. Won’t take hardly any time

@froggeric
Sorry I took so long, I had some synthetic data generating that took up all of my GPU space for longer than it should have.

But I made you 2:

  • This is a merge that I think will be good, maybe better. Definitely more NSFW.

  • This is the model you asked for, just not lasered.

Thank you. Both of your links point to the same model WestLake 2x7B, but I assume for the first one you mean KunoichiLake 2x7B. What is the difference? Kunoichi as the base?

Anyway, I have run a few tests, 10 so far, on the 4 models, and benchmarking them in a comparative way (ie: score of 1 to 4 for each test, reflecting the rank from best to worst, 1 begin the best). Here are the results (lowest score = better):

  • 1st place (20 points): original Westlake v2 7B. Ranked 1st 4 times, never 4th (last). Most consistent at following instructions to the letter while producing good answers/text.
  • 2nd place (21 points): WestLake 2x7B. Ranked 1st 3 times, and 4th 2 times. Usually gives output close to the original, but has slightly more difficulty following instructions, and text does not feel as well written as the original.
  • 3rd place (26 points): KunoichiLake 2x7B. Ranked 1st 3 times, 4th 3 times. Write some great text, which I think is better than the original model, and add more details. Unfortunately, it is let down by not following instructions so well, and sometimes erring on the non-sensical side.
  • 4th place (32 points): Laser WestLake 2x7B. Never ranked 1st, ranked 4th 4 times. Worse than the others in every aspect.

Overall I think the only model worth keeping, and maybe improving, is the KunoichiLake. Both of the others produce results close to the original, but usually worse, and slower due to the increased complexity of MoE. Howeve, maybe the comparison between the original Kunoichi and KunoichiLake would produce similar results; I have not tried it.

I think you should test kunoichiLake against Kunoichi to see more improvements. The kunoichi lake is geared more towards roleplay activities. I believe that kunoichi could easily be improved with some more prompt alignment.

Thank you for the testing! very interesting results

I have now tested KunoichiLake against the 3 Kunoichi variants (original, dpo, dpo v2). The non-MoE versions are again better.

The concept of MoE sounds promising, but it seems difficult to deliver good results. To me it seems like the problem is, it does not take consistently the best of all models. It does not take the worse, which is good, but it seems difficult to tune properly the decision making as to which model to use, and when.

Maybe it is better to use MoE with a group of highly specialised experts, each dedicated to a distinct area of knowledge. This would make the decision making easier and clear. When using models targeted toward similar tasks, how can it decide which model gives the best answer?

Well so what you’re describing is why I moved on to just fine tuning 7B models or SOLAR rather than MoE.

I still like to tinker with clown MoE because they’re creative and fun but there’s definitely still some work to be done.

As for the bit about highly specialized experts. Check out my “polyglot” collection. I explored the idea of highly specialized experts and it had fairly decent results.

Those models are way less about performing well and way more proof of concept. I wouldn’t even benchmark them, but they do work surprisingly well in up to 8 languages

Sign up or log in to comment