Clémentine commited on
Commit
74d43dc
1 Parent(s): 6fa8358

edit gsm8k comment

Browse files
Files changed (1) hide show
  1. src/index.html +2 -2
src/index.html CHANGED
@@ -317,8 +317,8 @@
317
  </div>
318
  </div>
319
 
320
- <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
321
- <p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
322
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
323
  <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
324
 
 
317
  </div>
318
  </div>
319
 
320
+ <p>The green dots highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. The red dots show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5.</p>
321
+ <p>From our current dive in the outputs and behaviors of models, chat versions of base models sometimes have a considerably lower score than the original models on MATH! This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
322
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
323
  <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
324