πŸŽ‰ Congrats πŸŽ‰

#5
by dnhkng - opened

This is a huge jump in performance!

What was the difference to the other Calme models, as they all use the same dataset?

This comment has been hidden
MaziyarPanahi pinned discussion

Thank you. Honestly, I didn't expect this to be this high. I made 7 experiments and started making them public for evaluation one after another. So 2.3 and 2.4 were under evaluation.

I had a quick look now to see what differentiates this one from the other two that we know (2.1 and 2.2):

  • I used MaziyarPanahi/calme-2.1-rys-78b as a base to fine-tune (since it had higher MMLU PRO and GPQA scores on average).
  • I can see this one went on for a long time! Compared to others which usually run for less than 1000 steps, this was trained for more than 3300 steps.
  • The dataset is a mix of what I usually use, like TruthfulQA and Orca, but I can see that for the first time, I used my own synthetically generated dataset that I thought would help with Chain of Thought and multi-step reasoning. (Something I did for LegalKit CoT; I thought maybe I could make DPO datasets.)
  • At the same time, there is another DPO dataset that tries to improve MMLU by introducing diverse multi-task understanding, and I used CLAIR to finalize the DPO (https://github.com/ContextualAI/CLAIR_and_APO).

This is my quick assessment. I will stop submitting any new experiments and instead go into details to make sure everything here is above board. I am happy to see the CoT and multi-step reasoning DPO datasets are successful, but I will dig into the datasets to be sure the model actually improved and didn't just learn how to answer certain questions.

Will come back with more once I find something interesting.

UPDATE: pinning the post so people can follow any updates

How did you achieve such high scores on MMLU Pro? It would seem quite impossible without training on the set itself, amazing if you didn't do that.

Have you tried doing and embedding both the new DPO dataset you used, and embedding the MMLU pro dataset, and checking the most similar items across sets?

Maybe by eye, you can judge if there was a knowledge leak.

Have you tried doing and embedding both the new DPO dataset you used, and embedding the MMLU pro dataset, and checking the most similar items across sets?

Maybe by eye, you can judge if there was a knowledge leak.

There is absolutely no similarity or contamination. I verified this even before the training by reviewing the questions. However, please note that the evaluation datasets are in SFT format, while I generated my own DPO/ORPO datasets from scratch for post-training and alignment. The likelihood of knowledge leakage is quite low.

That said, whether this process was fair is something I cannot determine at this moment. Specifically, I am unsure if it improved the model’s general ability to answer questions or only enhanced its capacity to handle multi-choice, multi-step reasoning questions. The model is large, making it challenging to conduct comprehensive evaluations.

I don't doubt you created the DPO datasets from scratch, but was it done synthetically using a private LLM, or hand-made by humans? I was wondering if the MMLU Pro dataset was potentially created by ChatGPT4 or Claude. That would leak data, as the data comes from the same source.

You are right about multistep thinking though. I'm not sure if it's fair to judge a one-shot model with something like OpenAIs o1 models. The o1's can try repeatedly to answer a question and select the best result when it thinks it's finally done. Not really an apples-to-apples comparison.

I always use local LLM to generate datasets, no budget to spend on private models and they are not transparent. This dataset was made with a mix of 3 local models above 70B size.

Sign up or log in to comment