olm
/

olm-gpt2-dec-2022

@@ -65,28 +65,59 @@ The model was trained according to the OLM GPT2 instructions at this [repo](http
 The model achieves the following results without any fine-tuning (zero-shot):
 | Task        | Metric     | Original GPT2       | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
 |:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
-|rte          |acc         |0.5307               |0.5199                    |                             |
-|piqa         |acc/acc_norm|0.6289/0.6251        |**0.6692**/**0.6665**     |            |
-|copa         |acc         |0.6400               |0.6800                    |                             |
-|record       |f1/em       |0.7094/0.7026        |0.6884/0.6818            |             |
-|boolq        |acc         |0.4872               |0.6021                |                        |
-|cb           |acc/f1      |0.4101/0.2619        |0.3393/0.1840            |/NA                          |
-|hellaswag    |acc/acc_norm|0.2892/0.3114        |0.3079/0.3482     |              |
-|mrpc         |acc/f1      |0.5662/0.6911        |0.6814/0.8099     |              |
-|multirc      |acc         |0.0189               |0.0220                    |                            |
-|lambada      |ppl/acc     |40.0554/0.3256       |28.3359/0.3699   |             |
-|wsc          |acc         |0.4327               |0.3654                   |                            |
-|wic          |acc         |0.4922               |0.5000                      |                            |
-|mnli         |acc         |0.3372               |0.3501                |                         |
-|qnli         |acc         |0.5017               |0.4946                   |                             |
-|cola         |mcc         |0.0126               |0.0000                    |                            |
-|triviaqa     |acc         |0.0151               |0.0181                |                        |
-|winogrande   |acc         |0.5162               |0.5051                   |                            |
-|webqs        |acc         |0.0030               |0.0079                |                        |
-|arc_easy     |acc/acc_norm|0.4381/0.3948        |0.4693/0.4230     |              |
-|arc_challenge|acc/acc_norm|0.1903/0.2270        |0.2090/0.2398            |                   |
 To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
 which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.

 The model achieves the following results without any fine-tuning (zero-shot):
+|     Task     |Version|     Metric     |Value |   |Stderr|
+|--------------|------:|----------------|-----:|---|------|
+|webqs         |      0|acc_p_value     |0.0000|   |      |
+|triviaqa      |      1|acc_p_value     |0.0088|   |      |
+|arc_easy      |      0|acc_p_value     |0.0022|   |      |
+|              |       |acc_norm_p_value|0.0049|   |      |
+|arc_challenge |      0|acc_p_value     |0.1017|   |      |
+|              |       |acc_norm_p_value|0.2957|   |      |
+|copa          |      0|acc_p_value     |0.4070|   |      |
+|qnli          |      0|acc_p_value     |0.2913|   |      |
+|lambada_openai|      0|ppl_p_value     |0.0000|   |      |
+|              |       |acc_p_value     |0.0000|   |      |
+|mrpc          |      0|acc_p_value     |0.0000|   |      |
+|              |       |f1_p_value      |0.0000|   |      |
+|wsc           |      0|acc_p_value     |0.1680|   |      |
+|winogrande    |      0|acc_p_value     |0.4314|   |      |
+|hellaswag     |      0|acc_p_value     |0.0000|   |      |
+|              |       |acc_norm_p_value|0.0000|   |      |
+|rte           |      0|acc_p_value     |0.7184|   |      |
+|mnli          |      0|acc_p_value     |0.0071|   |      |
+|multirc       |      1|acc_p_value     |0.4755|   |      |
+|cb            |      1|acc_p_value     |0.2816|   |      |
+|boolq         |      1|acc_p_value     |0.0000|   |      |
+|wic           |      0|acc_p_value     |0.6924|   |      |
+|piqa          |      0|acc_p_value     |0.0004|   |      |
+|              |       |acc_norm_p_value|0.0003|   |      |
+|cola          |      0|mcc_p_value     |0.6880|   |      |
+|record        |      0|f1_p_value      |0.0000|   |      |
+|              |       |em_p_value      |0.0000|   |      |
 | Task        | Metric     | Original GPT2       | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
 |:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
+|rte          |acc         |0.5307               |0.5199                    |0.7184                             |
+|piqa         |acc/acc_norm|0.6289/0.6251        |**0.6692**/**0.6665**     |**0.0004**/**0.0003**              |
+|copa         |acc         |0.6400               |0.6800                    |0.4070                             |
+|record       |f1/em       |**0.7094**/**0.7026**|0.6884/0.6818             |**0.0000**/**0.0000**              |
+|boolq        |acc         |0.4872               |**0.6021**                |**0.0000**                         |
+|cb           |acc/f1      |0.4107/0.2619        |0.3393/0.1840             |0.2816/NA                          |
+|hellaswag    |acc/acc_norm|0.2892/0.3114        |**0.3079**/**0.3482**     |**0.0000**/**0.0000**              |
+|mrpc         |acc/f1      |0.5662/0.6911        |**0.6814**/**0.8099**     |**0.0000**/**0.0000**              |
+|multirc      |acc         |0.0189               |0.0220                    |0.4755                             |
+|lambada      |ppl/acc     |40.0554/0.3256       |**28.3359**/**0.3699**    |**0.0000**/**0.0000**              |
+|wsc          |acc         |0.4327               |0.3654                    |0.1680                             |
+|wic          |acc         |0.4922               |0.5000                    |0.6924                             |
+|mnli         |acc         |0.3372               |**0.3501**                |**0.0071**                         |
+|qnli         |acc         |0.5017               |0.4946                    |0.2913                             |
+|cola         |mcc         |0.0126               |0.0000                    |0.6880                             |
+|triviaqa     |acc         |0.0151               |**0.0181**                |**0.0088**                         |
+|winogrande   |acc         |0.5162               |0.5051                    |0.4314                             |
+|webqs        |acc         |0.0030               |**0.0079**                |**0.0000**                         |
+|arc_easy     |acc/acc_norm|0.4381/0.3948        |**0.4693**/**0.4230**     |**0.0022**/**0.0049**              |
+|arc_challenge|acc/acc_norm|0.1903/0.2270        |0.2090/0.2398             |0.1017/0.2957                      |
 To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
 which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.