Spaces:
Running
Running
Merge branch 'main' of hf.co:spaces/open-llm-leaderboard/blog
Browse files- src/index.html +2 -1
src/index.html
CHANGED
@@ -145,7 +145,7 @@
|
|
145 |
<aside>
|
146 |
<p><em>Should we have included more evaluations?</em></p>
|
147 |
|
148 |
-
<p>We chose to focus on a limited number of evaluations to keep the computation time realistic.
|
149 |
</aside>
|
150 |
|
151 |
<p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
|
@@ -207,6 +207,7 @@
|
|
207 |
|
208 |
<h2>New leaderboard, new results!</h2>
|
209 |
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
|
|
210 |
|
211 |
<h3>What do the rankings look like?</h3>
|
212 |
|
|
|
145 |
<aside>
|
146 |
<p><em>Should we have included more evaluations?</em></p>
|
147 |
|
148 |
+
<p>We chose to focus on a limited number of evaluations to keep the computation time realistic. We wanted to include many other evaluations (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
|
149 |
</aside>
|
150 |
|
151 |
<p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
|
|
|
207 |
|
208 |
<h2>New leaderboard, new results!</h2>
|
209 |
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
210 |
+
<aside>As the cluster has been extremely full, models of more than 140B parameters (such as Falcon-180B and BLOOM) will be run a bit later. </aside>
|
211 |
|
212 |
<h3>What do the rankings look like?</h3>
|
213 |
|