gx-ai-architect commited on
Commit
8f64d68
1 Parent(s): fa88994

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -18,9 +18,11 @@ base_model: mistralai/Mistral-7B-v0.1
18
  # Model Card for Merlinite-7B-pt 🔥
19
 
20
  ### Overview
21
- We introduce a preference-tuned model, **Merlinite-7B-pt**, to the InstructLab model family. **Merlinite-7B-pt** is first supervised-finetuned (SFT) via LAB using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback. Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy. We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and Mt-Bench improvements.
22
 
23
- The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1, **without using any human annotation or proprietary models**. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.
 
 
24
 
25
  ### Performance
26
 
 
18
  # Model Card for Merlinite-7B-pt 🔥
19
 
20
  ### Overview
21
+ We introduce **Merlinite-7B-pt**, a strong 7b open-source chat model, aigned using AI feedback **without using any human annotation or proprietary models**.
22
 
23
+ **Merlinite-7B-pt** is first supervised-finetuned (SFT) via LAB using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback. Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy. We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and Mt-Bench improvements.
24
+
25
+ The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.
26
 
27
  ### Performance
28