clefourrier HF staff osanseviero HF staff commited on
Commit
52b4c86
1 Parent(s): 3685542

Minor grammar changes (#1)

Browse files

- Minor grammar changes (bcf8b15097132b20cb87fcbe907cb087f52f171f)


Co-authored-by: Omar Sanseviero <[email protected]>

Files changed (1) hide show
  1. src/index.html +14 -17
src/index.html CHANGED
@@ -55,28 +55,25 @@
55
  <d-contents>
56
  </d-contents>
57
 
58
- <p>Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago, when they wanted to reproduce and compare results from several published models.
59
- It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful but most of the case,
60
- just using optimized prompts or evaluation setup to give best chances to the models. They therefore decided to create a place where reference models would be
61
- evaluated in the exact same setup (same questions, asked in the same order, ), to gather completely reproducible and comparable results; and that’s how the
62
  Open LLM Leaderboard was born!</p>
63
 
64
- <p> Following a series of highly-visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.</p>
65
 
66
- <p> We estimate that around 300 000 community members use and collaborate on it monthly through submissions and discussions; usually to: </p>
67
  <ul>
68
- <li> Find state-of-the-art open source releases as the leaderboardit provides reproducible scores separating marketing fluff from actual progress in the field.</li>
69
- <li> Evaluate their own work, be it pretraining or finetuning, comparing methods in the open and to the best existing models, and earning public recognition for their work.</li>
70
  </ul>
71
 
72
- <p> However, with success, both in the leaderboard and the increasing performances of the models came challenges and after one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!</p>
73
-
74
- <p>Here is why we think a new leaderboard was needed 👇</p>
75
-
76
-
77
- <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
78
 
 
79
 
 
80
 
81
  <h3>The need for a more challenging leaderboard</h3>
82
 
@@ -91,9 +88,9 @@
91
  </div>
92
 
93
  <ol>
94
- <li>They became too easy for models. For instance on HellaSwag, MMLU and ARC, models are now reaching baseline human performance, a phenomenon called saturation.</li>
95
- <li>Some newer models also showed signs of contamination. By this we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting general performances of model and started to over-fit on some evaluation dataset instead of being reflective of the more general performances of the task being tested. This was in particular the case for GSM8K and TruthfulQA which were included in some instruction fine-tuning sets.</li>
96
- <li>Some benchmarks contained errors: MMLU was recently investigated in depth by several groups who surfaced mistakes in its responses and proposed new versions. Another example was the fact that GSM8K used some specific end of generation token (<code>:</code>) which unfairly pushed down performance of many verbose models.</li>
97
  </ol>
98
 
99
  <p>We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!</p>
 
55
  <d-contents>
56
  </d-contents>
57
 
58
+ <p>Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago when they wanted to reproduce and compare results from several published models.
59
+ It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful, but in most cases,
60
+ just using optimized prompts or evaluation setup to give the best chances to the models. They therefore decided to create a place where reference models would be
61
+ evaluated in the exact same setup (same questions, asked in the same order, etc.) to gather completely reproducible and comparable results; and that’s how the
62
  Open LLM Leaderboard was born!</p>
63
 
64
+ <p> Following a series of highly visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.</p>
65
 
66
+ <p> Around 300,000 community members use and collaborate on it monthly through submissions and discussions, usually to: </p>
67
  <ul>
68
+ <li> Find state-of-the-art open-source releases as the leaderboard provides reproducible scores separating marketing fluff from actual progress in the field.</li>
69
+ <li> Evaluate their work, be it pretraining or finetuning, comparing methods in the open and to the best existing models, and earning public recognition.</li>
70
  </ul>
71
 
72
+ <p> However, with success, both in the leaderboard and the increasing performances of the models came challenges. After one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!</p>
 
 
 
 
 
73
 
74
+ <p>Here is why we think a new leaderboard is needed 👇</p>
75
 
76
+ <h2>Harder, better, faster, stronger: Introducing the LLM Leaderboard v2</h2>
77
 
78
  <h3>The need for a more challenging leaderboard</h3>
79
 
 
88
  </div>
89
 
90
  <ol>
91
+ <li>They became too easy for models. For instance, models on HellaSwag, MMLU, and ARC are now reaching baseline human performance, a phenomenon called saturation.</li>
92
+ <li>Some newer models also showed signs of contamination. By this, we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting the general performance of the model and started to overfit on some evaluation datasets instead of reflecting the more general performance of the task being tested. This was, in particular, the case for GSM8K and TruthfulQA, which were included in some instruction fine-tuning sets.</li>
93
+ <li>Some benchmarks contained errors. MMLU was recently investigated in depth by several groups, which surfaced mistakes in its responses and proposed new versions. Another example was that GSM8K used a specific end-of-generation token (:), which unfairly pushed down the performance of many verbose models.</li>
94
  </ol>
95
 
96
  <p>We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!</p>