Spaces:

open-llm-leaderboard
/

blog

Running

Clémentine commited on 9 days ago

Commit

74d43dc

•

1 Parent(s): 6fa8358

edit gsm8k comment

Files changed (1) hide show

src/index.html CHANGED Viewed

@@ -317,8 +317,8 @@
                 </div>
             </div>
-            <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
-            <p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
             <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
             <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>

                 </div>
             </div>
+            <p>The green dots highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. The red dots show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5.</p>
+            <p>From our current dive in the outputs and behaviors of models, chat versions of base models sometimes have a considerably lower score than the original models on MATH! This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
             <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
             <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>