Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -60,7 +60,7 @@ Instead of training preference models or prompting large language models (LLMs)
60
 
61
  We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
62
 
63
- Having Mixtral log-ratio as reward model, we then choose iterative rejection sampling fine-tuning as the RL alignment method. For each prompt, we sample \( N \) times from the current optimal policy (starting from the SFT model). We then query the preference reward and select the highest scoring sample as the target. The initial policy is updated through supervised fine-tuning based on the outputs of rejection sampling. This process is iterated by conducting additional rounds of best-of-N sampling followed by SFT training.
64
 
65
  The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
66
 
@@ -70,9 +70,9 @@ The prompts space for preference tuning were uniformly sampled by source from th
70
 
71
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/Vt0eldYNUW1vOpLBd-_DI.png" width="650">
72
 
73
- The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
74
 
75
- We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as dhown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer gives improvements on either MT-Bench nor Mixtral-DPO rewards.
76
 
77
  The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
78
 
 
60
 
61
  We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
62
 
63
+ Having Mixtral logprob-ratio as reward model, we then choose iterative rejection sampling fine-tuning as the RL alignment method. For each prompt, we sample \( N \) times from the current optimal policy (starting from the SFT model). We then query the preference reward and select the highest scoring sample as the target. The initial policy is updated through supervised fine-tuning based on the outputs of rejection sampling. This process is iterated by conducting additional rounds of best-of-N sampling followed by SFT training.
64
 
65
  The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
66
 
 
70
 
71
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/Vt0eldYNUW1vOpLBd-_DI.png" width="650">
72
 
73
+ The preference tuned version of Merlinite-7B-pt shows overall performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
74
 
75
+ We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as shown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer gives improvements on either MT-Bench nor Mixtral-DPO rewards.
76
 
77
  The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
78