mtasic85 commited on
Commit
fd94ec3
1 Parent(s): c162f45

pretrain eval

Browse files
Files changed (1) hide show
  1. README.md +286 -13
README.md CHANGED
@@ -44,44 +44,317 @@ This model **isn't** designed for immediate use but rather for Continued Pretrai
44
 
45
  The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.
46
 
47
- [loss, val_loss]()
48
 
49
- [val_ppl]()
50
 
51
- [epoch]()
52
 
53
- [learning_rate]()
54
 
55
- ## lm-evaluation-harness
 
 
56
 
57
  ```bash
58
  litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
59
  ```
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ```bash
62
  litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
63
  ```
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ```bash
66
- litgpt evaluate --tasks 'bbh_zeroshot,bbh_fewshot,bbh_cot_fewshot,bbh_cot_zeroshot' --out_dir 'evaluate-bigbenchhard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
67
  ```
68
 
 
 
 
 
 
 
 
69
  ```bash
70
  litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
71
  ```
72
 
73
- ```bash
74
- litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
75
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
- ```bash
78
- litgpt evaluate --tasks 'mmlu_multilingual,mgsm' --out_dir 'evaluate-multilinguals/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
79
- ```
 
 
 
 
 
80
 
81
  ```bash
82
- litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
83
  ```
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ```bash
86
  litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
87
  ```
 
 
 
 
 
 
 
 
 
44
 
45
  The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.
46
 
47
+ [loss, val_loss](https://api.wandb.ai/links/mtasic85/x4dxkpy7)
48
 
49
+ [val_ppl](https://api.wandb.ai/links/mtasic85/hr03vzo2)
50
 
51
+ [epoch](https://api.wandb.ai/links/mtasic85/e6ai0066)
52
 
53
+ [learning_rate](https://api.wandb.ai/links/mtasic85/lap8xl2w)
54
 
55
+ ## Pretrain Evaluation
56
+
57
+ ### lm-evaluation-harness
58
 
59
  ```bash
60
  litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
61
  ```
62
 
63
+ | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
64
+ |---------------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
65
+ |arc_challenge | 1|none | 0|acc |↑ |0.1852|± |0.0114|
66
+ | | |none | 0|acc_norm |↑ |0.2201|± |0.0121|
67
+ |gsm8k | 3|flexible-extract| 5|exact_match|↑ |0.0205|± |0.0039|
68
+ | | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000|
69
+ |hellaswag | 1|none | 0|acc |↑ |0.2628|± |0.0044|
70
+ | | |none | 0|acc_norm |↑ |0.2705|± |0.0044|
71
+ |mmlu | 2|none | |acc |↑ |0.2468|± |0.0036|
72
+ | - humanities | 2|none | |acc |↑ |0.2459|± |0.0063|
73
+ | - formal_logic | 1|none | 0|acc |↑ |0.3175|± |0.0416|
74
+ | - high_school_european_history | 1|none | 0|acc |↑ |0.2364|± |0.0332|
75
+ | - high_school_us_history | 1|none | 0|acc |↑ |0.2304|± |0.0296|
76
+ | - high_school_world_history | 1|none | 0|acc |↑ |0.2194|± |0.0269|
77
+ | - international_law | 1|none | 0|acc |↑ |0.2479|± |0.0394|
78
+ | - jurisprudence | 1|none | 0|acc |↑ |0.2315|± |0.0408|
79
+ | - logical_fallacies | 1|none | 0|acc |↑ |0.2147|± |0.0323|
80
+ | - moral_disputes | 1|none | 0|acc |↑ |0.2168|± |0.0222|
81
+ | - moral_scenarios | 1|none | 0|acc |↑ |0.2726|± |0.0149|
82
+ | - philosophy | 1|none | 0|acc |↑ |0.1865|± |0.0221|
83
+ | - prehistory | 1|none | 0|acc |↑ |0.2191|± |0.0230|
84
+ | - professional_law | 1|none | 0|acc |↑ |0.2490|± |0.0110|
85
+ | - world_religions | 1|none | 0|acc |↑ |0.3450|± |0.0365|
86
+ | - other | 2|none | |acc |↑ |0.2385|± |0.0076|
87
+ | - business_ethics | 1|none | 0|acc |↑ |0.2200|± |0.0416|
88
+ | - clinical_knowledge | 1|none | 0|acc |↑ |0.2264|± |0.0258|
89
+ | - college_medicine | 1|none | 0|acc |↑ |0.2601|± |0.0335|
90
+ | - global_facts | 1|none | 0|acc |↑ |0.1900|± |0.0394|
91
+ | - human_aging | 1|none | 0|acc |↑ |0.2422|± |0.0288|
92
+ | - management | 1|none | 0|acc |↑ |0.2330|± |0.0419|
93
+ | - marketing | 1|none | 0|acc |↑ |0.2821|± |0.0295|
94
+ | - medical_genetics | 1|none | 0|acc |↑ |0.2900|± |0.0456|
95
+ | - miscellaneous | 1|none | 0|acc |↑ |0.2388|± |0.0152|
96
+ | - nutrition | 1|none | 0|acc |↑ |0.1993|± |0.0229|
97
+ | - professional_accounting | 1|none | 0|acc |↑ |0.2270|± |0.0250|
98
+ | - professional_medicine | 1|none | 0|acc |↑ |0.2610|± |0.0267|
99
+ | - virology | 1|none | 0|acc |↑ |0.2349|± |0.0330|
100
+ | - social sciences | 2|none | |acc |↑ |0.2632|± |0.0079|
101
+ | - econometrics | 1|none | 0|acc |↑ |0.2544|± |0.0410|
102
+ | - high_school_geography | 1|none | 0|acc |↑ |0.1869|± |0.0278|
103
+ | - high_school_government_and_politics| 1|none | 0|acc |↑ |0.2850|± |0.0326|
104
+ | - high_school_macroeconomics | 1|none | 0|acc |↑ |0.3128|± |0.0235|
105
+ | - high_school_microeconomics | 1|none | 0|acc |↑ |0.2773|± |0.0291|
106
+ | - high_school_psychology | 1|none | 0|acc |↑ |0.2422|± |0.0184|
107
+ | - human_sexuality | 1|none | 0|acc |↑ |0.2595|± |0.0384|
108
+ | - professional_psychology | 1|none | 0|acc |↑ |0.2435|± |0.0174|
109
+ | - public_relations | 1|none | 0|acc |↑ |0.2273|± |0.0401|
110
+ | - security_studies | 1|none | 0|acc |↑ |0.3265|± |0.0300|
111
+ | - sociology | 1|none | 0|acc |↑ |0.2537|± |0.0308|
112
+ | - us_foreign_policy | 1|none | 0|acc |↑ |0.3000|± |0.0461|
113
+ | - stem | 2|none | |acc |↑ |0.2404|± |0.0076|
114
+ | - abstract_algebra | 1|none | 0|acc |↑ |0.1700|± |0.0378|
115
+ | - anatomy | 1|none | 0|acc |↑ |0.2074|± |0.0350|
116
+ | - astronomy | 1|none | 0|acc |↑ |0.2105|± |0.0332|
117
+ | - college_biology | 1|none | 0|acc |↑ |0.2153|± |0.0344|
118
+ | - college_chemistry | 1|none | 0|acc |↑ |0.2000|± |0.0402|
119
+ | - college_computer_science | 1|none | 0|acc |↑ |0.2300|± |0.0423|
120
+ | - college_mathematics | 1|none | 0|acc |↑ |0.1700|± |0.0378|
121
+ | - college_physics | 1|none | 0|acc |↑ |0.2647|± |0.0439|
122
+ | - computer_security | 1|none | 0|acc |↑ |0.2700|± |0.0446|
123
+ | - conceptual_physics | 1|none | 0|acc |↑ |0.2766|± |0.0292|
124
+ | - electrical_engineering | 1|none | 0|acc |↑ |0.2552|± |0.0363|
125
+ | - elementary_mathematics | 1|none | 0|acc |↑ |0.2566|± |0.0225|
126
+ | - high_school_biology | 1|none | 0|acc |↑ |0.2097|± |0.0232|
127
+ | - high_school_chemistry | 1|none | 0|acc |↑ |0.2611|± |0.0309|
128
+ | - high_school_computer_science | 1|none | 0|acc |↑ |0.2600|± |0.0441|
129
+ | - high_school_mathematics | 1|none | 0|acc |↑ |0.2111|± |0.0249|
130
+ | - high_school_physics | 1|none | 0|acc |↑ |0.2517|± |0.0354|
131
+ | - high_school_statistics | 1|none | 0|acc |↑ |0.3056|± |0.0314|
132
+ | - machine_learning | 1|none | 0|acc |↑ |0.2857|± |0.0429|
133
+ |truthfulqa_mc2 | 2|none | 0|acc |↑ |0.5010|± |0.0159|
134
+ |winogrande | 1|none | 0|acc |↑ |0.5130|± |0.0140|
135
+
136
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
137
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
138
+ |mmlu | 2|none | |acc |↑ |0.2468|± |0.0036|
139
+ | - humanities | 2|none | |acc |↑ |0.2459|± |0.0063|
140
+ | - other | 2|none | |acc |↑ |0.2385|± |0.0076|
141
+ | - social sciences| 2|none | |acc |↑ |0.2632|± |0.0079|
142
+ | - stem | 2|none | |acc |↑ |0.2404|± |0.0076|
143
+
144
  ```bash
145
  litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
146
  ```
147
 
148
+ | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
149
+ |-----------------------------------------------------------|-------|------|-----:|-----------------------|---|-----:|---|------|
150
+ |leaderboard | N/A| | | | | | | |
151
+ | - leaderboard_bbh | N/A| | | | | | | |
152
+ | - leaderboard_bbh_boolean_expressions | 1|none | 3|acc_norm |↑ |0.4680|± |0.0316|
153
+ | - leaderboard_bbh_causal_judgement | 1|none | 3|acc_norm |↑ |0.5187|± |0.0366|
154
+ | - leaderboard_bbh_date_understanding | 1|none | 3|acc_norm |↑ |0.1880|± |0.0248|
155
+ | - leaderboard_bbh_disambiguation_qa | 1|none | 3|acc_norm |↑ |0.3440|± |0.0301|
156
+ | - leaderboard_bbh_formal_fallacies | 1|none | 3|acc_norm |↑ |0.4720|± |0.0316|
157
+ | - leaderboard_bbh_geometric_shapes | 1|none | 3|acc_norm |↑ |0.1200|± |0.0206|
158
+ | - leaderboard_bbh_hyperbaton | 1|none | 3|acc_norm |↑ |0.5240|± |0.0316|
159
+ | - leaderboard_bbh_logical_deduction_five_objects | 1|none | 3|acc_norm |↑ |0.2160|± |0.0261|
160
+ | - leaderboard_bbh_logical_deduction_seven_objects | 1|none | 3|acc_norm |↑ |0.1400|± |0.0220|
161
+ | - leaderboard_bbh_logical_deduction_three_objects | 1|none | 3|acc_norm |↑ |0.3200|± |0.0296|
162
+ | - leaderboard_bbh_movie_recommendation | 1|none | 3|acc_norm |↑ |0.2360|± |0.0269|
163
+ | - leaderboard_bbh_navigate | 1|none | 3|acc_norm |↑ |0.4200|± |0.0313|
164
+ | - leaderboard_bbh_object_counting | 1|none | 3|acc_norm |↑ |0.1000|± |0.0190|
165
+ | - leaderboard_bbh_penguins_in_a_table | 1|none | 3|acc_norm |↑ |0.1575|± |0.0303|
166
+ | - leaderboard_bbh_reasoning_about_colored_objects | 1|none | 3|acc_norm |↑ |0.0920|± |0.0183|
167
+ | - leaderboard_bbh_ruin_names | 1|none | 3|acc_norm |↑ |0.2480|± |0.0274|
168
+ | - leaderboard_bbh_salient_translation_error_detection | 1|none | 3|acc_norm |↑ |0.1200|± |0.0206|
169
+ | - leaderboard_bbh_snarks | 1|none | 3|acc_norm |↑ |0.4888|± |0.0376|
170
+ | - leaderboard_bbh_sports_understanding | 1|none | 3|acc_norm |↑ |0.4600|± |0.0316|
171
+ | - leaderboard_bbh_temporal_sequences | 1|none | 3|acc_norm |↑ |0.2440|± |0.0272|
172
+ | - leaderboard_bbh_tracking_shuffled_objects_five_objects | 1|none | 3|acc_norm |↑ |0.1560|± |0.0230|
173
+ | - leaderboard_bbh_tracking_shuffled_objects_seven_objects| 1|none | 3|acc_norm |↑ |0.0960|± |0.0187|
174
+ | - leaderboard_bbh_tracking_shuffled_objects_three_objects| 1|none | 3|acc_norm |↑ |0.3800|± |0.0308|
175
+ | - leaderboard_bbh_web_of_lies | 1|none | 3|acc_norm |↑ |0.4720|± |0.0316|
176
+ | - leaderboard_gpqa | N/A| | | | | | | |
177
+ | - leaderboard_gpqa_diamond | 1|none | 0|acc_norm |↑ |0.1970|± |0.0283|
178
+ | - leaderboard_gpqa_extended | 1|none | 0|acc_norm |↑ |0.2509|± |0.0186|
179
+ | - leaderboard_gpqa_main | 1|none | 0|acc_norm |↑ |0.2589|± |0.0207|
180
+ | - leaderboard_ifeval | 3|none | 0|inst_level_loose_acc |↑ |0.2650|± | N/A|
181
+ | | |none | 0|inst_level_strict_acc |↑ |0.2530|± | N/A|
182
+ | | |none | 0|prompt_level_loose_acc |↑ |0.1590|± |0.0157|
183
+ | | |none | 0|prompt_level_strict_acc|↑ |0.1553|± |0.0156|
184
+ | - leaderboard_math_hard | N/A| | | | | | | |
185
+ | - leaderboard_math_algebra_hard | 1|none | 4|exact_match |↑ |0.0000|± | 0|
186
+ | - leaderboard_math_counting_and_prob_hard | 1|none | 4|exact_match |↑ |0.0000|± | 0|
187
+ | - leaderboard_math_geometry_hard | 1|none | 4|exact_match |↑ |0.0000|± | 0|
188
+ | - leaderboard_math_intermediate_algebra_hard | 1|none | 4|exact_match |↑ |0.0000|± | 0|
189
+ | - leaderboard_math_num_theory_hard | 1|none | 4|exact_match |↑ |0.0000|± | 0|
190
+ | - leaderboard_math_prealgebra_hard | 1|none | 4|exact_match |↑ |0.0000|± | 0|
191
+ | - leaderboard_math_precalculus_hard | 1|none | 4|exact_match |↑ |0.0000|± | 0|
192
+ | - leaderboard_mmlu_pro | 0.1|none | 5|acc |↑ |0.1174|± |0.0029|
193
+ | - leaderboard_musr | N/A| | | | | | | |
194
+ | - leaderboard_musr_murder_mysteries | 1|none | 0|acc_norm |↑ |0.5160|± |0.0317|
195
+ | - leaderboard_musr_object_placements | 1|none | 0|acc_norm |↑ |0.2695|± |0.0278|
196
+ | - leaderboard_musr_team_allocation | 1|none | 0|acc_norm |↑ |0.3480|± |0.0302|
197
+
198
  ```bash
199
+ litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
200
  ```
201
 
202
+ |Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
203
+ |------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
204
+ |gsm8k | 3|flexible-extract| 5|exact_match|↑ |0.0205|± |0.0039|
205
+ | | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000|
206
+ |mathqa| 1|none | 0|acc |↑ |0.2010|± |0.0073|
207
+ | | |none | 0|acc_norm |↑ |0.2077|± |0.0074|
208
+
209
  ```bash
210
  litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
211
  ```
212
 
213
+ | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
214
+ |---------------------------------------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
215
+ |mmlu | 2|none | |acc |↑ |0.2468|± |0.0036|
216
+ | - humanities | 2|none | |acc |↑ |0.2459|± |0.0063|
217
+ | - formal_logic | 1|none | 0|acc |↑ |0.3175|± |0.0416|
218
+ | - high_school_european_history | 1|none | 0|acc |↑ |0.2364|± |0.0332|
219
+ | - high_school_us_history | 1|none | 0|acc |↑ |0.2304|± |0.0296|
220
+ | - high_school_world_history | 1|none | 0|acc |↑ |0.2194|± |0.0269|
221
+ | - international_law | 1|none | 0|acc |↑ |0.2479|± |0.0394|
222
+ | - jurisprudence | 1|none | 0|acc |↑ |0.2315|± |0.0408|
223
+ | - logical_fallacies | 1|none | 0|acc |↑ |0.2147|± |0.0323|
224
+ | - moral_disputes | 1|none | 0|acc |↑ |0.2168|± |0.0222|
225
+ | - moral_scenarios | 1|none | 0|acc |↑ |0.2726|± |0.0149|
226
+ | - philosophy | 1|none | 0|acc |↑ |0.1865|± |0.0221|
227
+ | - prehistory | 1|none | 0|acc |↑ |0.2191|± |0.0230|
228
+ | - professional_law | 1|none | 0|acc |↑ |0.2490|± |0.0110|
229
+ | - world_religions | 1|none | 0|acc |↑ |0.3450|± |0.0365|
230
+ | - other | 2|none | |acc |↑ |0.2385|± |0.0076|
231
+ | - business_ethics | 1|none | 0|acc |↑ |0.2200|± |0.0416|
232
+ | - clinical_knowledge | 1|none | 0|acc |↑ |0.2264|± |0.0258|
233
+ | - college_medicine | 1|none | 0|acc |↑ |0.2601|± |0.0335|
234
+ | - global_facts | 1|none | 0|acc |↑ |0.1900|± |0.0394|
235
+ | - human_aging | 1|none | 0|acc |↑ |0.2422|± |0.0288|
236
+ | - management | 1|none | 0|acc |↑ |0.2330|± |0.0419|
237
+ | - marketing | 1|none | 0|acc |↑ |0.2821|± |0.0295|
238
+ | - medical_genetics | 1|none | 0|acc |↑ |0.2900|± |0.0456|
239
+ | - miscellaneous | 1|none | 0|acc |↑ |0.2388|± |0.0152|
240
+ | - nutrition | 1|none | 0|acc |↑ |0.1993|± |0.0229|
241
+ | - professional_accounting | 1|none | 0|acc |↑ |0.2270|± |0.0250|
242
+ | - professional_medicine | 1|none | 0|acc |↑ |0.2610|± |0.0267|
243
+ | - virology | 1|none | 0|acc |↑ |0.2349|± |0.0330|
244
+ | - social sciences | 2|none | |acc |↑ |0.2632|± |0.0079|
245
+ | - econometrics | 1|none | 0|acc |↑ |0.2544|± |0.0410|
246
+ | - high_school_geography | 1|none | 0|acc |↑ |0.1869|± |0.0278|
247
+ | - high_school_government_and_politics| 1|none | 0|acc |↑ |0.2850|± |0.0326|
248
+ | - high_school_macroeconomics | 1|none | 0|acc |↑ |0.3128|± |0.0235|
249
+ | - high_school_microeconomics | 1|none | 0|acc |↑ |0.2773|± |0.0291|
250
+ | - high_school_psychology | 1|none | 0|acc |↑ |0.2422|± |0.0184|
251
+ | - human_sexuality | 1|none | 0|acc |↑ |0.2595|± |0.0384|
252
+ | - professional_psychology | 1|none | 0|acc |↑ |0.2435|± |0.0174|
253
+ | - public_relations | 1|none | 0|acc |↑ |0.2273|± |0.0401|
254
+ | - security_studies | 1|none | 0|acc |↑ |0.3265|± |0.0300|
255
+ | - sociology | 1|none | 0|acc |↑ |0.2537|± |0.0308|
256
+ | - us_foreign_policy | 1|none | 0|acc |↑ |0.3000|± |0.0461|
257
+ | - stem | 2|none | |acc |↑ |0.2404|± |0.0076|
258
+ | - abstract_algebra | 1|none | 0|acc |↑ |0.1700|± |0.0378|
259
+ | - anatomy | 1|none | 0|acc |↑ |0.2074|± |0.0350|
260
+ | - astronomy | 1|none | 0|acc |↑ |0.2105|± |0.0332|
261
+ | - college_biology | 1|none | 0|acc |↑ |0.2153|± |0.0344|
262
+ | - college_chemistry | 1|none | 0|acc |↑ |0.2000|± |0.0402|
263
+ | - college_computer_science | 1|none | 0|acc |↑ |0.2300|± |0.0423|
264
+ | - college_mathematics | 1|none | 0|acc |↑ |0.1700|± |0.0378|
265
+ | - college_physics | 1|none | 0|acc |↑ |0.2647|± |0.0439|
266
+ | - computer_security | 1|none | 0|acc |↑ |0.2700|± |0.0446|
267
+ | - conceptual_physics | 1|none | 0|acc |↑ |0.2766|± |0.0292|
268
+ | - electrical_engineering | 1|none | 0|acc |↑ |0.2552|± |0.0363|
269
+ | - elementary_mathematics | 1|none | 0|acc |↑ |0.2566|± |0.0225|
270
+ | - high_school_biology | 1|none | 0|acc |↑ |0.2097|± |0.0232|
271
+ | - high_school_chemistry | 1|none | 0|acc |↑ |0.2611|± |0.0309|
272
+ | - high_school_computer_science | 1|none | 0|acc |↑ |0.2600|± |0.0441|
273
+ | - high_school_mathematics | 1|none | 0|acc |↑ |0.2111|± |0.0249|
274
+ | - high_school_physics | 1|none | 0|acc |↑ |0.2517|± |0.0354|
275
+ | - high_school_statistics | 1|none | 0|acc |↑ |0.3056|± |0.0314|
276
+ | - machine_learning | 1|none | 0|acc |↑ |0.2857|± |0.0429|
277
+ |mmlu_pro | 2|custom-extract| |exact_match|↑ |0.0000|± |0.0000|
278
+ | - biology | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
279
+ | - business | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
280
+ | - chemistry | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
281
+ | - computer_science | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
282
+ | - economics | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
283
+ | - engineering | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
284
+ | - health | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
285
+ | - history | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
286
+ | - law | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
287
+ | - math | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
288
+ | - other | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
289
+ | - philosophy | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
290
+ | - physics | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
291
+ | - psychology | 1|custom-extract| 5|exact_match|↑ |0.0000|± |0.0000|
292
 
293
+ | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr|
294
+ |------------------|------:|--------------|------|-----------|---|-----:|---|-----:|
295
+ |mmlu | 2|none | |acc |↑ |0.2468|± |0.0036|
296
+ | - humanities | 2|none | |acc |↑ |0.2459|± |0.0063|
297
+ | - other | 2|none | |acc |↑ |0.2385|± |0.0076|
298
+ | - social sciences| 2|none | |acc |↑ |0.2632|± |0.0079|
299
+ | - stem | 2|none | |acc |↑ |0.2404|± |0.0076|
300
+ |mmlu_pro | 2|custom-extract| |exact_match|↑ |0.0000|± |0.0000|
301
 
302
  ```bash
303
+ litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
304
  ```
305
 
306
+ | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
307
+ |-------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
308
+ |arc_challenge | 1|none | 0|acc |↑ |0.1852|± |0.0114|
309
+ | | |none | 0|acc_norm |↑ |0.2201|± |0.0121|
310
+ |boolq | 2|none | 0|acc |↑ |0.4446|± |0.0087|
311
+ |gpqa_diamond_cot_n_shot | 2|flexible-extract| 0|exact_match|↑ |0.0859|± |0.0200|
312
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
313
+ |gpqa_diamond_cot_zeroshot | 1|flexible-extract| 0|exact_match|↑ |0.0606|± |0.0170|
314
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
315
+ |gpqa_diamond_generative_n_shot | 2|flexible-extract| 0|exact_match|↑ |0.1717|± |0.0269|
316
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
317
+ |gpqa_diamond_n_shot | 2|none | 0|acc |↑ |0.2677|± |0.0315|
318
+ | | |none | 0|acc_norm |↑ |0.2677|± |0.0315|
319
+ |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ |0.1970|± |0.0283|
320
+ | | |none | 0|acc_norm |↑ |0.1970|± |0.0283|
321
+ |gpqa_extended_cot_n_shot | 2|flexible-extract| 0|exact_match|↑ |0.0971|± |0.0127|
322
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
323
+ |gpqa_extended_cot_zeroshot | 1|flexible-extract| 0|exact_match|↑ |0.0696|± |0.0109|
324
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
325
+ |gpqa_extended_generative_n_shot| 2|flexible-extract| 0|exact_match|↑ |0.1502|± |0.0153|
326
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
327
+ |gpqa_extended_n_shot | 2|none | 0|acc |↑ |0.2399|± |0.0183|
328
+ | | |none | 0|acc_norm |↑ |0.2399|± |0.0183|
329
+ |gpqa_extended_zeroshot | 1|none | 0|acc |↑ |0.2473|± |0.0185|
330
+ | | |none | 0|acc_norm |↑ |0.2473|± |0.0185|
331
+ |gpqa_main_cot_n_shot | 2|flexible-extract| 0|exact_match|↑ |0.1116|± |0.0149|
332
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
333
+ |gpqa_main_cot_zeroshot | 1|flexible-extract| 0|exact_match|↑ |0.0625|± |0.0114|
334
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
335
+ |gpqa_main_generative_n_shot | 2|flexible-extract| 0|exact_match|↑ |0.1384|± |0.0163|
336
+ | | |strict-match | 0|exact_match|↑ |0.0000|± |0.0000|
337
+ |gpqa_main_n_shot | 2|none | 0|acc |↑ |0.2388|± |0.0202|
338
+ | | |none | 0|acc_norm |↑ |0.2388|± |0.0202|
339
+ |gpqa_main_zeroshot | 1|none | 0|acc |↑ |0.2500|± |0.0205|
340
+ | | |none | 0|acc_norm |↑ |0.2500|± |0.0205|
341
+ |hellaswag | 1|none | 0|acc |↑ |0.2628|± |0.0044|
342
+ | | |none | 0|acc_norm |↑ |0.2705|± |0.0044|
343
+ |openbookqa | 1|none | 0|acc |↑ |0.1360|± |0.0153|
344
+ | | |none | 0|acc_norm |↑ |0.2620|± |0.0197|
345
+ |piqa | 1|none | 0|acc |↑ |0.5550|± |0.0116|
346
+ | | |none | 0|acc_norm |↑ |0.5528|± |0.0116|
347
+ |truthfulqa_mc2 | 2|none | 0|acc |↑ |0.5010|± |0.0159|
348
+ |winogrande | 1|none | 0|acc |↑ |0.5130|± |0.0140|
349
+
350
  ```bash
351
  litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
352
  ```
353
+
354
+ | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
355
+ |---------------|------:|------|-----:|---------------|---|--------:|---|------|
356
+ |qasper_bool | 1|none | 0|f1 |↑ | 0.8966|± |0.0166|
357
+ |qasper_freeform| 2|none | 0|f1_abstractive |↑ | 0.0597|± |0.0052|
358
+ |wikitext | 2|none | 0|bits_per_byte |↓ | 2.2154|± | N/A|
359
+ | | |none | 0|byte_perplexity|↓ | 4.6441|± | N/A|
360
+ | | |none | 0|word_perplexity|↓ |3683.1019|± | N/A|