Spaces:
Running
Running
Clémentine
commited on
Commit
•
74d43dc
1
Parent(s):
6fa8358
edit gsm8k comment
Browse files- src/index.html +2 -2
src/index.html
CHANGED
@@ -317,8 +317,8 @@
|
|
317 |
</div>
|
318 |
</div>
|
319 |
|
320 |
-
<p>
|
321 |
-
<p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
|
322 |
<p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
|
323 |
<p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
|
324 |
|
|
|
317 |
</div>
|
318 |
</div>
|
319 |
|
320 |
+
<p>The green dots highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. The red dots show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5.</p>
|
321 |
+
<p>From our current dive in the outputs and behaviors of models, chat versions of base models sometimes have a considerably lower score than the original models on MATH! This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
|
322 |
<p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
|
323 |
<p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
|
324 |
|