olm
/

olm-gpt2-dec-2022

@@ -65,28 +65,30 @@ The model was trained according to the OLM GPT2 instructions at this [repo](http
 The model achieves the following results without any fine-tuning (zero-shot):
-| Task        | Version | Metric     | Original GPT2       | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
-|:------------|:--------|:-----------|--------------------:|-------------------------:|----------------------------------:|
-|rte          |0        |acc         |0.5307               |0.5199                    |0.7184                             |
-|piqa         |0        |acc/acc_norm|0.6289/0.6251        |**0.6692**/**0.6665**     |**0.0004**/**0.0003**              |
-|copa         |0        |acc         |0.6400               |0.6800                    |0.4070                             |
-|record       |0        |f1/em       |**0.7094**/**0.7026**|0.6884/0.6818             |**0.0000**/**0.0000**              |
-|boolq        |1        |acc         |0.4872               |**0.6021**                |**0.0000**                         |
-|cb           |1        |acc/f1      |0.4107/0.2619        |0.3393/0.1840             |0.2816/NA                          |
-|hellaswag    |0        |acc/acc_norm|0.2892/0.3114        |**0.3079**/**0.3482**     |**0.0000**/**0.0000**              |
-|mrpc         |0        |acc/f1      |0.5662/0.6911        |**0.6814**/**0.8099**     |**0.0000**/**0.0000**              |
-|multirc      |1        |acc         |0.0189               |0.0220                    |0.4755                             |
-|lambada      |0        |ppl/acc     |40.0554/0.3256       |**28.3359**/**0.3699**    |**0.0000**/**0.0000**              |
-|wsc          |0        |acc         |0.4327               |0.3654                    |0.1680                             |
-|wic          |0        |acc         |0.4922               |0.5000                    |0.6924                             |
-|mnli         |0        |acc         |0.3372               |**0.3501**                |**0.0071**                         |
-|qnli         |0        |acc         |0.5017               |0.4946                    |0.2913                             |
-|cola         |0        |mcc         |0.0126               |0.0000                    |0.6880                             |
-|triviaqa     |1        |acc         |0.0151               |**0.0181**                |**0.0088**                         |
-|winogrande   |0        |acc         |0.5162               |0.5051                    |0.4314                             |
-|webqs        |0        |acc         |0.0030               |**0.0079**                |**0.0000**                         |
-|arc_easy     |0        |acc/acc_norm|0.4381/0.3948        |**0.4693**/**0.4230**     |**0.0022**/**0.0049**              |
-|arc_challenge|0        |acc/acc_norm|0.1903/0.2270        |0.2090/0.2398             |0.1017/0.2957                      |
-To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
-which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.

 The model achieves the following results without any fine-tuning (zero-shot):
+| Task        | Metric     | Original GPT2       | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
+|:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
+|rte          |acc         |0.5307               |0.5199                    |0.7184                             |
+|piqa         |acc/acc_norm|0.6289/0.6251        |**0.6692**/**0.6665**     |**0.0004**/**0.0003**              |
+|copa         |acc         |0.6400               |0.6800                    |0.4070                             |
+|record       |f1/em       |**0.7094**/**0.7026**|0.6884/0.6818             |**0.0000**/**0.0000**              |
+|boolq        |acc         |0.4872               |**0.6021**                |**0.0000**                         |
+|cb           |acc/f1      |0.4107/0.2619        |0.3393/0.1840             |0.2816/NA                          |
+|hellaswag    |acc/acc_norm|0.2892/0.3114        |**0.3079**/**0.3482**     |**0.0000**/**0.0000**              |
+|mrpc         |acc/f1      |0.5662/0.6911        |**0.6814**/**0.8099**     |**0.0000**/**0.0000**              |
+|multirc      |acc         |0.0189               |0.0220                    |0.4755                             |
+|lambada      |ppl/acc     |40.0554/0.3256       |**28.3359**/**0.3699**    |**0.0000**/**0.0000**              |
+|wsc          |acc         |0.4327               |0.3654                    |0.1680                             |
+|wic          |acc         |0.4922               |0.5000                    |0.6924                             |
+|mnli         |acc         |0.3372               |**0.3501**                |**0.0071**                         |
+|qnli         |acc         |0.5017               |0.4946                    |0.2913                             |
+|cola         |mcc         |0.0126               |0.0000                    |0.6880                             |
+|triviaqa     |acc         |0.0151               |**0.0181**                |**0.0088**                         |
+|winogrande   |acc         |0.5162               |0.5051                    |0.4314                             |
+|webqs        |acc         |0.0030               |**0.0079**                |**0.0000**                         |
+|arc_easy     |acc/acc_norm|0.4381/0.3948        |**0.4693**/**0.4230**     |**0.0022**/**0.0049**              |
+|arc_challenge|acc/acc_norm|0.1903/0.2270        |0.2090/0.2398             |0.1017/0.2957                      |
+To get these results, we used commit `f079e322b857714fcef1ada9e78ddc606fe51e84` of the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
+which can produce results different than those reported in the GPT2 paper.
+We added a change [here](https://github.com/EleutherAI/lm-evaluation-harness/compare/master...mathemakitten:lm-evaluation-harness:master) to enable evaluation of the OLM GPT2, which has a very slightly different vocab size.
+The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.