@BramVanroy on Hugging Face: "🔎 DPO hyperparameter search update! In my previous post…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

BramVanroy

posted an update Jan 29

Post

🔎 DPO hyperparameter search update!

In my previous post (https://huggingface.co/posts/BramVanroy/633544255876795), I indicated how despite high reward accuracies and low losses, my model would sometimes just output repeating random tokens (/*****/). There were some useful brainstorms in that thread. I think the dataset is relatively easy for the model, leading it to quickly overfit when the beta is very small, which allows the model to step away further from its initially outputs.

So, I ran a hyperparameter search for learning rate (1e-7 v 5e-7), batch size (32, 64, 96, 128) and most importantly, beta (0.01, 0.1, 0.2, 0.5). You can have a look at the results for yourself here: https://wandb.ai/bramvanroy/dpo-geitje-ultra-hyperparams

Interpreting the result, I'd think that the beta=0.5 is the better choice for this dataset. Reasons:

- markedly higher rewards margins compared to all other betas
- better balance between positive chosen and negative rejected rewards
- log probabilities are not as superbly low as for beta=0.01, which seems too low for this dataset

Of course, that is just purely looking at numbers without running any benchmarks. However, I am hesitant to evaluate all the models on benchmarks and, therefore, literally optimising my hyperparameters on a test set (which is very bad!). So I will just play with some of the most promising models and see which one feels "best" qualitatively.

If you have other insights, thoughts, or opinions, let me know!

davanstrien

Jan 29

Excellent work! Would be great if someone did a big sweep across a bunch of datasets and parameters to produce guidelines based on dataset properties/model size, etc.

BramVanroy

Jan 31

From my limited experience, looking at potential unwanted text properties of chosen v. rejected can be another one to investigate. I think the model may learn, for instance, that longer sequences are better and will therefore learn to generate longer sequences regardless of the quality of the content. You can catch this in the metrics, I believe, but only in the log probs (which would then also be very low for the chosen text, perhaps even lower than the rejected text). You'll likely not notice this in the rewards metrics.

BramVanroy

Jan 31

Update (since my post can't be too long): it seems that beta=0.5 was too high! Looking at the log probabilities, you'll notice that despite high rewards margins the log probabilities of all beta=0.5's are inconsistent (chosen log probs are lower than the rejected ones!). I am not sure what causes this. Perhaps the model has become good at discerning the two texts on other characteristics (like text length) but isn't really certain about the probabilities? In any case, I think that 0.1 beta is better. Perhaps there is some middle ground that I missed during grid search but looking at the results qualitatively, it seems that 0.1 is a good model.

In this post