This is a distillation experiment with SmolLM2-1.7B as teacher and SmolLM2-360M as student model.
It slightly improves upon the performance of the basemodel on the following tasks (wip):
Tasks |
HuggingFaceTB/SmolLM2-360M Value |
aloobun/d-SmolLM2-360M Value |
- leaderboard_bbh_causal_judgement |
0.4545 |
0.4652 |
- leaderboard_bbh_geometric_shapes |
0.1680 |
0.2040 |
- leaderboard_bbh_movie_recommendation |
0.2120 |
0.2440 |
- leaderboard_bbh_penguins_in_a_table |
0.2055 |
0.2123 |
- leaderboard_bbh_reasoning_about_colored_objects |
0.1160 |
0.1320 |
- leaderboard_bbh_ruin_names |
0.2360 |
0.2480 |
- leaderboard_bbh_salient_translation_error_detection |
0.1480 |
0.2120 |
- leaderboard_bbh_snarks |
0.5169 |
0.5281 |
- leaderboard_bbh_temporal_sequences |
0.2720 |
0.2800 |
- leaderboard_musr_murder_mysteries |
0.5040 |
0.5160 |
Well, it didn’t work as well as I hoped, will try again.
Eval Results aloobun/d-SmolLM2-360M (WIP)
GPQA
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard_gpqa |
N/A |
|
|
|
|
|
|
|
- leaderboard_gpqa_diamond |
1 |
none |
0 |
acc_norm |
↑ |
0.2071 |
± |
0.0289 |
- leaderboard_gpqa_extended |
1 |
none |
0 |
acc_norm |
↑ |
0.2308 |
± |
0.0180 |
- leaderboard_gpqa_main |
1 |
none |
0 |
acc_norm |
↑ |
0.2679 |
± |
0.0209 |
MUSR
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard_musr |
N/A |
|
|
|
|
|
|
|
- leaderboard_musr_murder_mysteries |
1 |
none |
0 |
acc_norm |
↑ |
0.5160 |
± |
0.0317 |
- leaderboard_musr_object_placements |
1 |
none |
0 |
acc_norm |
↑ |
0.2383 |
± |
0.0267 |
- leaderboard_musr_team_allocation |
1 |
none |
0 |
acc_norm |
↑ |
0.4400 |
± |
0.0315 |
BBH
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard_bbh |
N/A |
|
|
|
|
|
|
|
- leaderboard_bbh_boolean_expressions |
1 |
none |
3 |
acc_norm |
↑ |
0.5480 |
± |
0.0315 |
- leaderboard_bbh_causal_judgement |
1 |
none |
3 |
acc_norm |
↑ |
0.4652 |
± |
0.0366 |
- leaderboard_bbh_date_understanding |
1 |
none |
3 |
acc_norm |
↑ |
0.1560 |
± |
0.0230 |
- leaderboard_bbh_disambiguation_qa |
1 |
none |
3 |
acc_norm |
↑ |
0.3120 |
± |
0.0294 |
- leaderboard_bbh_formal_fallacies |
1 |
none |
3 |
acc_norm |
↑ |
0.5240 |
± |
0.0316 |
- leaderboard_bbh_geometric_shapes |
1 |
none |
3 |
acc_norm |
↑ |
0.2040 |
± |
0.0255 |
- leaderboard_bbh_hyperbaton |
1 |
none |
3 |
acc_norm |
↑ |
0.5000 |
± |
0.0317 |
- leaderboard_bbh_logical_deduction_five_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.2240 |
± |
0.0264 |
- leaderboard_bbh_logical_deduction_seven_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1440 |
± |
0.0222 |
- leaderboard_bbh_logical_deduction_three_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.3320 |
± |
0.0298 |
- leaderboard_bbh_movie_recommendation |
1 |
none |
3 |
acc_norm |
↑ |
0.2440 |
± |
0.0272 |
- leaderboard_bbh_navigate |
1 |
none |
3 |
acc_norm |
↑ |
0.5800 |
± |
0.0313 |
- leaderboard_bbh_object_counting |
1 |
none |
3 |
acc_norm |
↑ |
0.2080 |
± |
0.0257 |
- leaderboard_bbh_penguins_in_a_table |
1 |
none |
3 |
acc_norm |
↑ |
0.2123 |
± |
0.0340 |
- leaderboard_bbh_reasoning_about_colored_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1320 |
± |
0.0215 |
- leaderboard_bbh_ruin_names |
1 |
none |
3 |
acc_norm |
↑ |
0.2480 |
± |
0.0274 |
- leaderboard_bbh_salient_translation_error_detection |
1 |
none |
3 |
acc_norm |
↑ |
0.2120 |
± |
0.0259 |
- leaderboard_bbh_snarks |
1 |
none |
3 |
acc_norm |
↑ |
0.5281 |
± |
0.0375 |
- leaderboard_bbh_sports_understanding |
1 |
none |
3 |
acc_norm |
↑ |
0.4600 |
± |
0.0316 |
- leaderboard_bbh_temporal_sequences |
1 |
none |
3 |
acc_norm |
↑ |
0.2800 |
± |
0.0285 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1720 |
± |
0.0239 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1440 |
± |
0.0222 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.3000 |
± |
0.0290 |
- leaderboard_bbh_web_of_lies |
1 |
none |
3 |
acc_norm |
↑ |
0.5480 |
± |
0.0315 |
MMLU_PRO
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard_mmlu_pro |
0.1 |
none |
5 |
acc |
↑ |
0.1173 |
± |
0.0029 |
IFEVAL
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard_ifeval |
3 |
none |
0 |
inst_level_loose_acc |
↑ |
0.2866 |
± |
N/A |
|
|
none |
0 |
inst_level_strict_acc |
↑ |
0.2770 |
± |
N/A |
|
|
none |
0 |
prompt_level_loose_acc |
↑ |
0.1497 |
± |
0.0154 |
|
|
none |
0 |
prompt_level_strict_acc |
↑ |
0.1423 |
± |
0.0150 |
MATH HARD
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard_math_hard |
N/A |
|
|
|
|
|
|
|
- leaderboard_math_algebra_hard |
2 |
none |
4 |
exact_match |
↑ |
0.0033 |
± |
0.0033 |
- leaderboard_math_counting_and_prob_hard |
2 |
none |
4 |
exact_match |
↑ |
0.0081 |
± |
0.0081 |
- leaderboard_math_geometry_hard |
2 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- leaderboard_math_intermediate_algebra_hard |
2 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- leaderboard_math_num_theory_hard |
2 |
none |
4 |
exact_match |
↑ |
0.0065 |
± |
0.0065 |
- leaderboard_math_prealgebra_hard |
2 |
none |
4 |
exact_match |
↑ |
0.0104 |
± |
0.0073 |
- leaderboard_math_precalculus_hard |
2 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |