olm
/

olm-gpt2-dec-2022

@@ -65,59 +65,28 @@ The model was trained according to the OLM GPT2 instructions at this [repo](http
 The model achieves the following results without any fine-tuning (zero-shot):
-|     Task     |Version|     Metric     |Value |   |Stderr|
-|--------------|------:|----------------|-----:|---|------|
-|webqs         |      0|acc_p_value     |0.0000|   |      |
-|triviaqa      |      1|acc_p_value     |0.0088|   |      |
-|arc_easy      |      0|acc_p_value     |0.0022|   |      |
-|              |       |acc_norm_p_value|0.0049|   |      |
-|arc_challenge |      0|acc_p_value     |0.1017|   |      |
-|              |       |acc_norm_p_value|0.2957|   |      |
-|copa          |      0|acc_p_value     |0.4070|   |      |
-|qnli          |      0|acc_p_value     |0.2913|   |      |
-|lambada_openai|      0|ppl_p_value     |0.0000|   |      |
-|              |       |acc_p_value     |0.0000|   |      |
-|mrpc          |      0|acc_p_value     |0.0000|   |      |
-|              |       |f1_p_value      |0.0000|   |      |
-|wsc           |      0|acc_p_value     |0.1680|   |      |
-|winogrande    |      0|acc_p_value     |0.4314|   |      |
-|hellaswag     |      0|acc_p_value     |0.0000|   |      |
-|              |       |acc_norm_p_value|0.0000|   |      |
-|rte           |      0|acc_p_value     |0.7184|   |      |
-|mnli          |      0|acc_p_value     |0.0071|   |      |
-|multirc       |      1|acc_p_value     |0.4755|   |      |
-|cb            |      1|acc_p_value     |0.2816|   |      |
-|boolq         |      1|acc_p_value     |0.0000|   |      |
-|wic           |      0|acc_p_value     |0.6924|   |      |
-|piqa          |      0|acc_p_value     |0.0004|   |      |
-|              |       |acc_norm_p_value|0.0003|   |      |
-|cola          |      0|mcc_p_value     |0.6880|   |      |
-|record        |      0|f1_p_value      |0.0000|   |      |
-|              |       |em_p_value      |0.0000|   |      |
-| Task        | Metric     | Original GPT2       | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
-|:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
-|rte          |acc         |0.5307               |0.5199                    |0.7184                             |
-|piqa         |acc/acc_norm|0.6289/0.6251        |**0.6692**/**0.6665**     |**0.0004**/**0.0003**              |
-|copa         |acc         |0.6400               |0.6800                    |0.4070                             |
-|record       |f1/em       |**0.7094**/**0.7026**|0.6884/0.6818             |**0.0000**/**0.0000**              |
-|boolq        |acc         |0.4872               |**0.6021**                |**0.0000**                         |
-|cb           |acc/f1      |0.4107/0.2619        |0.3393/0.1840             |0.2816/NA                          |
-|hellaswag    |acc/acc_norm|0.2892/0.3114        |**0.3079**/**0.3482**     |**0.0000**/**0.0000**              |
-|mrpc         |acc/f1      |0.5662/0.6911        |**0.6814**/**0.8099**     |**0.0000**/**0.0000**              |
-|multirc      |acc         |0.0189               |0.0220                    |0.4755                             |
-|lambada      |ppl/acc     |40.0554/0.3256       |**28.3359**/**0.3699**    |**0.0000**/**0.0000**              |
-|wsc          |acc         |0.4327               |0.3654                    |0.1680                             |
-|wic          |acc         |0.4922               |0.5000                    |0.6924                             |
-|mnli         |acc         |0.3372               |**0.3501**                |**0.0071**                         |
-|qnli         |acc         |0.5017               |0.4946                    |0.2913                             |
-|cola         |mcc         |0.0126               |0.0000                    |0.6880                             |
-|triviaqa     |acc         |0.0151               |**0.0181**                |**0.0088**                         |
-|winogrande   |acc         |0.5162               |0.5051                    |0.4314                             |
-|webqs        |acc         |0.0030               |**0.0079**                |**0.0000**                         |
-|arc_easy     |acc/acc_norm|0.4381/0.3948        |**0.4693**/**0.4230**     |**0.0022**/**0.0049**              |
-|arc_challenge|acc/acc_norm|0.1903/0.2270        |0.2090/0.2398             |0.1017/0.2957                      |
 To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
 which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.

 The model achieves the following results without any fine-tuning (zero-shot):
+| Task        | Version | Metric     | Original GPT2       | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
+|:------------|:--------|:-----------|--------------------:|-------------------------:|----------------------------------:|
+|rte          |0        |acc         |0.5307               |0.5199                    |0.7184                             |
+|piqa         |0        |acc/acc_norm|0.6289/0.6251        |**0.6692**/**0.6665**     |**0.0004**/**0.0003**              |
+|copa         |0        |acc         |0.6400               |0.6800                    |0.4070                             |
+|record       |0        |f1/em       |**0.7094**/**0.7026**|0.6884/0.6818             |**0.0000**/**0.0000**              |
+|boolq        |1        |acc         |0.4872               |**0.6021**                |**0.0000**                         |
+|cb           |1        |acc/f1      |0.4107/0.2619        |0.3393/0.1840             |0.2816/NA                          |
+|hellaswag    |0        |acc/acc_norm|0.2892/0.3114        |**0.3079**/**0.3482**     |**0.0000**/**0.0000**              |
+|mrpc         |0        |acc/f1      |0.5662/0.6911        |**0.6814**/**0.8099**     |**0.0000**/**0.0000**              |
+|multirc      |1        |acc         |0.0189               |0.0220                    |0.4755                             |
+|lambada      |0        |ppl/acc     |40.0554/0.3256       |**28.3359**/**0.3699**    |**0.0000**/**0.0000**              |
+|wsc          |0        |acc         |0.4327               |0.3654                    |0.1680                             |
+|wic          |0        |acc         |0.4922               |0.5000                    |0.6924                             |
+|mnli         |0        |acc         |0.3372               |**0.3501**                |**0.0071**                         |
+|qnli         |0        |acc         |0.5017               |0.4946                    |0.2913                             |
+|cola         |0        |mcc         |0.0126               |0.0000                    |0.6880                             |
+|triviaqa     |1        |acc         |0.0151               |**0.0181**                |**0.0088**                         |
+|winogrande   |0        |acc         |0.5162               |0.5051                    |0.4314                             |
+|webqs        |0        |acc         |0.0030               |**0.0079**                |**0.0000**                         |
+|arc_easy     |0        |acc/acc_norm|0.4381/0.3948        |**0.4693**/**0.4230**     |**0.0022**/**0.0049**              |
+|arc_challenge|0        |acc/acc_norm|0.1903/0.2270        |0.2090/0.2398             |0.1017/0.2957                      |
 To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
 which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.