killawhale2 commited on
Commit
2b079b2
1 Parent(s): 6625d6d

add fine-tuning dataset details

Browse files
Files changed (1) hide show
  1. README.md +34 -3
README.md CHANGED
@@ -1,5 +1,12 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
4
 
5
  # **Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!**
@@ -19,11 +26,35 @@ Solar 10.7B is an ideal choice for fine-tuning. SOLAR-10.7B offers robustness an
19
  # **Instruction Fine-Tuning Strategy**
20
 
21
  We utilize state-of-the-art instruction fine-tuning methods including supervised fine-tuning (SFT) and direct preference optimization (DPO) [1].
22
- Using open source datasets with Alpaca- and OpenOrca-style and generated synthetic datasets, we apply iterative DPO training, a proprietary alignment strategy, to maximize the performance of our resulting model.
23
 
24
- *Note:* We were careful of data contamination during SFT and DPO, e.g., removing data created using TruthfulQA's prompts.
 
 
 
 
 
25
 
26
- [1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D. and Finn, C., 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  # **Evaluation Results**
29
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - c-s-ale/alpaca-gpt4-data
5
+ - Open-Orca/OpenOrca
6
+ - Intel/orca_dpo_pairs
7
+ - allenai/ultrafeedback_binarized_cleaned
8
+ language:
9
+ - en
10
  ---
11
 
12
  # **Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!**
 
26
  # **Instruction Fine-Tuning Strategy**
27
 
28
  We utilize state-of-the-art instruction fine-tuning methods including supervised fine-tuning (SFT) and direct preference optimization (DPO) [1].
 
29
 
30
+ We used a mixture of the following datasets
31
+ - c-s-ale/alpaca-gpt4-data (SFT)
32
+ - Open-Orca/OpenOrca (SFT)
33
+ - in-house generated data utilizing Metamath [2] (SFT, DPO)
34
+ - Intel/orca_dpo_pairs (DPO)
35
+ - allenai/ultrafeedback_binarized_cleaned (DPO)
36
 
37
+ where we were careful of data contamination by not using GSM8K samples when generating data and filtering tasks when applicable via the following list.
38
+ ```python
39
+ filtering_task_list = [
40
+ 'task228_arc_answer_generation_easy',
41
+ 'ai2_arc/ARC-Challenge:1.0.0',
42
+ 'ai2_arc/ARC-Easy:1.0.0',
43
+ 'task229_arc_answer_generation_hard',
44
+ 'hellaswag:1.1.0',
45
+ 'task1389_hellaswag_completion',
46
+ 'cot_gsm8k',
47
+ 'cot_gsm8k_ii',
48
+ 'drop:2.0.0',
49
+ 'winogrande:1.1.0'
50
+ ]
51
+ ```
52
+
53
+ Using the datasets mentioned above, we apply SFT and iterative DPO training, a proprietary alignment strategy, to maximize the performance of our resulting model.
54
+
55
+ [1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D. and Finn, C., 2023. Direct preference optimization: Your language model is secretly a reward model. NeurIPS.
56
+
57
+ [2] Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T., Li, Z., Weller, A. and Liu, W., 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
58
 
59
  # **Evaluation Results**
60