gx-ai-architect
/

merlinite-placeholder

@@ -18,6 +18,7 @@ base_model: mistralai/Mistral-7B-v0.1
 # Model Card for Merlinite-7B-pt 🔥
 ### Overview
 **Merlinite-7B-pt** is first supervised-finetuned (SFT) via LAB using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback. Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy. We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and Mt-Bench improvements.
 The final **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1, **without using any human annotation or proprietary models**. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.
@@ -57,7 +58,7 @@ We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for compu
 Having Mixtral log-ratio as reward model, we then choose iterative rejection sampling fine-tuning as the RL alignment method. For each prompt, we sample \( N \) times from the current optimal policy (starting from the SFT model). We then query the preference reward and select the highest scoring sample as the target. The initial policy is updated through supervised fine-tuning based on the outputs of rejection sampling. This process is iterated by conducting additional rounds of best-of-N sampling followed by SFT training.
-The prompts space for preference tuning were uniformly sampled by source from the LAB SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
 ### Discussion

 # Model Card for Merlinite-7B-pt 🔥
 ### Overview
 **Merlinite-7B-pt** is first supervised-finetuned (SFT) via LAB using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback. Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy. We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and Mt-Bench improvements.
 The final **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1, **without using any human annotation or proprietary models**. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.
 Having Mixtral log-ratio as reward model, we then choose iterative rejection sampling fine-tuning as the RL alignment method. For each prompt, we sample \( N \) times from the current optimal policy (starting from the SFT model). We then query the preference reward and select the highest scoring sample as the target. The initial policy is updated through supervised fine-tuning based on the outputs of rejection sampling. This process is iterated by conducting additional rounds of best-of-N sampling followed by SFT training.
+The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
 ### Discussion