javier-ab-bsc
commited on
Commit
•
ba5e504
1
Parent(s):
ad94f09
Removed math
Browse files
README.md
CHANGED
@@ -546,7 +546,7 @@ The dataset does not allow for external contributions.
|
|
546 |
|
547 |
### Gold-standard benchmarks
|
548 |
|
549 |
-
Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2157), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2154), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2153) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2155). These benchmarks include both new and existing tasks and datasets. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
|
550 |
|
551 |
We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
|
552 |
|
@@ -574,12 +574,6 @@ All results reported below are on a 5-shot setting.
|
|
574 |
<td>acc</td>
|
575 |
<td>74.06</td>
|
576 |
</tr>
|
577 |
-
<tr>
|
578 |
-
<td>Math</td>
|
579 |
-
<td>mgsm_direct_es</td>
|
580 |
-
<td>em</td>
|
581 |
-
<td>4</td>
|
582 |
-
</tr>
|
583 |
<tr>
|
584 |
<td rowspan="2">NLI</td>
|
585 |
<td>wnli_es</td>
|
@@ -633,12 +627,6 @@ All results reported below are on a 5-shot setting.
|
|
633 |
<td>acc</td>
|
634 |
<td>73.73</td>
|
635 |
</tr>
|
636 |
-
<tr>
|
637 |
-
<td>Math</td>
|
638 |
-
<td>mgsm_direct_ca</td>
|
639 |
-
<td>em</td>
|
640 |
-
<td>6</td>
|
641 |
-
</tr>
|
642 |
<tr>
|
643 |
<td rowspan="2">NLI</td>
|
644 |
<td>wnli_ca</td>
|
@@ -657,7 +645,7 @@ All results reported below are on a 5-shot setting.
|
|
657 |
<td>64.88</td>
|
658 |
</tr>
|
659 |
<tr>
|
660 |
-
<td>
|
661 |
<td>acc</td>
|
662 |
<td>61.5</td>
|
663 |
</tr>
|
@@ -668,22 +656,22 @@ All results reported below are on a 5-shot setting.
|
|
668 |
<td>69.23</td>
|
669 |
</tr>
|
670 |
<tr>
|
671 |
-
<td>
|
672 |
<td>acc</td>
|
673 |
<td>44.54</td>
|
674 |
</tr>
|
675 |
<tr>
|
676 |
-
<td>
|
677 |
<td>acc</td>
|
678 |
<td>36.8</td>
|
679 |
</tr>
|
680 |
<tr>
|
681 |
-
<td>
|
682 |
<td>acc</td>
|
683 |
<td>70.35</td>
|
684 |
</tr>
|
685 |
<tr>
|
686 |
-
<td>
|
687 |
<td>acc</td>
|
688 |
<td>48.26</td>
|
689 |
</tr>
|
@@ -716,12 +704,6 @@ All results reported below are on a 5-shot setting.
|
|
716 |
<td>acc</td>
|
717 |
<td>64.79</td>
|
718 |
</tr>
|
719 |
-
<tr>
|
720 |
-
<td>Math</td>
|
721 |
-
<td>mgsm_direct_eu</td>
|
722 |
-
<td>em</td>
|
723 |
-
<td>6</td>
|
724 |
-
</tr>
|
725 |
<tr>
|
726 |
<td rowspan="2">NLI</td>
|
727 |
<td>wnli_eu</td>
|
@@ -773,12 +755,6 @@ All results reported below are on a 5-shot setting.
|
|
773 |
<th>Result</th>
|
774 |
</tr></thead>
|
775 |
<tbody>
|
776 |
-
<tr>
|
777 |
-
<td>Math</td>
|
778 |
-
<td>mgsm_direct_gl</td>
|
779 |
-
<td>em</td>
|
780 |
-
<td>48</td>
|
781 |
-
</tr>
|
782 |
<tr>
|
783 |
<td rowspan="2">Paraphrasing</td>
|
784 |
<td>parafrases_gl</td>
|
@@ -826,12 +802,6 @@ All results reported below are on a 5-shot setting.
|
|
826 |
<td>acc</td>
|
827 |
<td>79.22</td>
|
828 |
</tr>
|
829 |
-
<tr>
|
830 |
-
<td>Math</td>
|
831 |
-
<td>mgsm_direct_en *</td>
|
832 |
-
<td>em</td>
|
833 |
-
<td>8</td>
|
834 |
-
</tr>
|
835 |
<tr>
|
836 |
<td rowspan="2">NLI</td>
|
837 |
<td>wnli</td>
|
|
|
546 |
|
547 |
### Gold-standard benchmarks
|
548 |
|
549 |
+
Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2157), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2154), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2153) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2155). We also use English tasks already available on the LM Evaluation Harness. These benchmarks include both new and existing tasks and datasets. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
|
550 |
|
551 |
We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
|
552 |
|
|
|
574 |
<td>acc</td>
|
575 |
<td>74.06</td>
|
576 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
577 |
<tr>
|
578 |
<td rowspan="2">NLI</td>
|
579 |
<td>wnli_es</td>
|
|
|
627 |
<td>acc</td>
|
628 |
<td>73.73</td>
|
629 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
630 |
<tr>
|
631 |
<td rowspan="2">NLI</td>
|
632 |
<td>wnli_ca</td>
|
|
|
645 |
<td>64.88</td>
|
646 |
</tr>
|
647 |
<tr>
|
648 |
+
<td>paws_ca</td>
|
649 |
<td>acc</td>
|
650 |
<td>61.5</td>
|
651 |
</tr>
|
|
|
656 |
<td>69.23</td>
|
657 |
</tr>
|
658 |
<tr>
|
659 |
+
<td>arc_ca_challenge</td>
|
660 |
<td>acc</td>
|
661 |
<td>44.54</td>
|
662 |
</tr>
|
663 |
<tr>
|
664 |
+
<td>openbookqa_ca</td>
|
665 |
<td>acc</td>
|
666 |
<td>36.8</td>
|
667 |
</tr>
|
668 |
<tr>
|
669 |
+
<td>piqa_ca</td>
|
670 |
<td>acc</td>
|
671 |
<td>70.35</td>
|
672 |
</tr>
|
673 |
<tr>
|
674 |
+
<td>siqa_ca</td>
|
675 |
<td>acc</td>
|
676 |
<td>48.26</td>
|
677 |
</tr>
|
|
|
704 |
<td>acc</td>
|
705 |
<td>64.79</td>
|
706 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
707 |
<tr>
|
708 |
<td rowspan="2">NLI</td>
|
709 |
<td>wnli_eu</td>
|
|
|
755 |
<th>Result</th>
|
756 |
</tr></thead>
|
757 |
<tbody>
|
|
|
|
|
|
|
|
|
|
|
|
|
758 |
<tr>
|
759 |
<td rowspan="2">Paraphrasing</td>
|
760 |
<td>parafrases_gl</td>
|
|
|
802 |
<td>acc</td>
|
803 |
<td>79.22</td>
|
804 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
805 |
<tr>
|
806 |
<td rowspan="2">NLI</td>
|
807 |
<td>wnli</td>
|