gx-ai-architect
/

merlinite-placeholder

Text Generation

Model card Files Files and versions Community

gx-ai-architect commited on Jun 17

Commit

8f64d68

•

1 Parent(s): fa88994

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -18,9 +18,11 @@ base_model: mistralai/Mistral-7B-v0.1
 # Model Card for Merlinite-7B-pt 🔥
 ### Overview
-We introduce a preference-tuned model, **Merlinite-7B-pt**, to the InstructLab model family. **Merlinite-7B-pt** is first supervised-finetuned (SFT) via LAB using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback. Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy. We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and Mt-Bench improvements.
-The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1, **without using any human annotation or proprietary models**. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.
 ### Performance

 # Model Card for Merlinite-7B-pt 🔥
 ### Overview
+We introduce **Merlinite-7B-pt**, a strong 7b open-source chat model, aigned using AI feedback **without using any human annotation or proprietary models**.
+**Merlinite-7B-pt** is first supervised-finetuned (SFT) via LAB using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback. Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy. We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and Mt-Bench improvements.
+The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistral-7B-Instruct-v0.1, Llama2-70b-chat and comparable to small-sized proprietary models like GPT3.5-Turbo-0314 and Claude-v1. It also exhibits superior instruction-following and human preference compared to the SFT Merlinite-7B model.
 ### Performance