killawhale2
commited on
Commit
•
6625d6d
1
Parent(s):
00ae028
Update README.md
Browse files
README.md
CHANGED
@@ -21,6 +21,8 @@ Solar 10.7B is an ideal choice for fine-tuning. SOLAR-10.7B offers robustness an
|
|
21 |
We utilize state-of-the-art instruction fine-tuning methods including supervised fine-tuning (SFT) and direct preference optimization (DPO) [1].
|
22 |
Using open source datasets with Alpaca- and OpenOrca-style and generated synthetic datasets, we apply iterative DPO training, a proprietary alignment strategy, to maximize the performance of our resulting model.
|
23 |
|
|
|
|
|
24 |
[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D. and Finn, C., 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
|
25 |
|
26 |
# **Evaluation Results**
|
|
|
21 |
We utilize state-of-the-art instruction fine-tuning methods including supervised fine-tuning (SFT) and direct preference optimization (DPO) [1].
|
22 |
Using open source datasets with Alpaca- and OpenOrca-style and generated synthetic datasets, we apply iterative DPO training, a proprietary alignment strategy, to maximize the performance of our resulting model.
|
23 |
|
24 |
+
*Note:* We were careful of data contamination during SFT and DPO, e.g., removing data created using TruthfulQA's prompts.
|
25 |
+
|
26 |
[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D. and Finn, C., 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
|
27 |
|
28 |
# **Evaluation Results**
|