Pythia models supervised finetuned and DPO finetuned with all of Anthropic-hh-rlhf dataset for 1 epoch.