Couple questions + feedback

#3
by SerialKicked - opened

Hey! For some reason this model appears 2 times in the open leaderboard with wildly different evals. Is there a reason why? Different versions? I'm mostly wondering because of the massive difference in IFEval between the two entries.

Beside that, it's the first good Qwen 2.5 32B fine-tune I've found. Haven't tested it for long, but so far (in Q4KM / 16K context):

  • Writing style is better than a Mistral Small (not exactly a big ask, but still)
  • Reasoning seems very decent, bit early for me to tell.
  • The instruction refusals I got were all in Chinese, even in a discussion that was 16Ktks worth of English.
  • Those refusals are rare. They are also very light, disappearing with a very basic system prompt
  • Does ok at instruction following and self prompting tasks (summarize, RAG, website navigation, API handling), but is still weaker than a Mistral Small with the correct instruct formatting.
  • In RP, it has a tendency to talk for both user and itself. Light, relatively easy to steer away, but noticeable. Doesn't happen when taking a more "assistant" role.

All things considered, pretty good.

The only difference is the use chat template flag on the evaluations submission. Its the same model. I also find very interesting the IFeval results, in our experiments the IFeval gets impacted easily. It has to be something not disclosed entirely on how the foundational model has been trained.

I assume it means that one was tested with the ChatML instruct format and the other one without (raw inputs)? If you got worse IFEval results with ChatML enabled, this would be a very funky result, indeed.

In any case, there's likely something weird with the base model, at least it seems to be the opinion of several people I talked to: it seems difficult to get decent/consistent results from tuning it.

Thanks for the clarification.

I just fixed the chat template my friends, pull it again.. u'll see :)

I'm using a GGUF and my own private front-end (think silly tavern but a lot more powerful), so I don't use the cfg's template automatically (yet).
From what I can read, it's standard ChatML with additional function calling tags. I didn't know it had function calling, I do need that feature so that's nice.

You're sure about replacing the <|im_end|> "eos_token" by <|endoftext|>? That seems wrong.

i ran the evals, it gets +1.85 on MATH and +26 on IFeval .. quantz stills not yet updated.

Welp, you know best :D I'll check the fixed GGUF whenever you have them ready.

uploaded the new GGUF's you can find them here:
https://huggingface.co/fblgit/TheBeagle-v2beta-MGS-GGUF

Sign up or log in to comment