Spaces:
Running
Running
Clémentine
commited on
Commit
•
511e6a5
1
Parent(s):
201ca87
added gsm8k, updated ranking table
Browse files- assets/scripts/math_vs_gsm8k.html +0 -0
- dist/assets/scripts/math_vs_gsm8k.html +0 -0
- dist/index.html +16 -31
- src/index.html +16 -31
assets/scripts/math_vs_gsm8k.html
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
dist/assets/scripts/math_vs_gsm8k.html
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
dist/index.html
CHANGED
@@ -212,77 +212,62 @@
|
|
212 |
|
213 |
<h2>New leaderboard, new results!</h2>
|
214 |
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
215 |
-
<aside>As the cluster has been extremely full,
|
216 |
|
217 |
<h3>What do the rankings look like?</h3>
|
218 |
|
219 |
-
<p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version,
|
220 |
-
<p>We’ve been particularly impressed by
|
221 |
-
<p>
|
222 |
-
<p>
|
223 |
|
224 |
<table>
|
225 |
<tr>
|
226 |
<th>Rank</th>
|
227 |
-
<th>Leaderboard
|
228 |
-
<th>Leaderboard v2</th>
|
229 |
</tr>
|
230 |
<tr>
|
231 |
<td>⭐</td>
|
232 |
-
<td><b>
|
233 |
-
<td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
|
234 |
</tr>
|
235 |
<tr>
|
236 |
<td>2</td>
|
237 |
<td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
|
238 |
-
<td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
|
239 |
</tr>
|
240 |
<tr>
|
241 |
<td>3</td>
|
242 |
-
<td><
|
243 |
-
<td>01-ai/Yi-1.5-34B-Chat</td>
|
244 |
</tr>
|
245 |
<tr>
|
246 |
<td>4</td>
|
247 |
-
<td>
|
248 |
-
<td><b>abacusai/Smaug-72B-v0.1</b></td>
|
249 |
</tr>
|
250 |
<tr>
|
251 |
<td>5</td>
|
252 |
-
<td>
|
253 |
-
<td><b>CohereForAI/c4ai-command-r-plus<b></td>
|
254 |
-
</tr>
|
255 |
<tr>
|
256 |
<td>6</td>
|
257 |
-
<td><b>
|
258 |
-
<td>Qwen/Qwen1.5-110B-Chat</td>
|
259 |
</tr>
|
260 |
<tr>
|
261 |
<td>7</td>
|
262 |
-
<td
|
263 |
-
<td>NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO</td>
|
264 |
</tr>
|
265 |
<tr>
|
266 |
<td>8</td>
|
267 |
-
<td>
|
268 |
-
<td><b>meta-llama/Meta-Llama-3-70B</b></td>
|
269 |
</tr>
|
270 |
<tr>
|
271 |
<td>9</td>
|
272 |
-
<td
|
273 |
-
<td>01-ai/Yi-1.5-9B-Chat</td>
|
274 |
</tr>
|
275 |
<tr>
|
276 |
<td>10</td>
|
277 |
-
<td>01-ai/Yi-1.5-
|
278 |
-
<td>01-ai/Yi-1.5-34B-32K</td>
|
279 |
</tr>
|
280 |
</table>
|
281 |
-
<p>We’ve been particularly impressed by Llama-70B-instruct, who is the best model across many evaluations (though it has 15 points less than it’s base counterpart on GPQA - does instruct tuning remove knowledge?).</p>
|
282 |
-
|
283 |
-
<p>Interestingly, a new challenger climbed the ranks to arrive in 2nd place despite its smaller size: Phi-3-medium-4K-instruct, only 13B parameters but a performance equivalent to models 2 to 4 times its size.</p>
|
284 |
|
285 |
-
<p>
|
286 |
|
287 |
<div class="main-plot-container">
|
288 |
<figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
|
|
|
212 |
|
213 |
<h2>New leaderboard, new results!</h2>
|
214 |
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
215 |
+
<aside>As the cluster has been extremely full, you can expect models to keep on appearing in the next days! </aside>
|
216 |
|
217 |
<h3>What do the rankings look like?</h3>
|
218 |
|
219 |
+
<p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
|
220 |
+
<p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long range reasoning, and knowledge)</p>
|
221 |
+
<p>The current second best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge.</p>
|
222 |
+
<p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
|
223 |
|
224 |
<table>
|
225 |
<tr>
|
226 |
<th>Rank</th>
|
227 |
+
<th>New Leaderboard Ranking</th>
|
|
|
228 |
</tr>
|
229 |
<tr>
|
230 |
<td>⭐</td>
|
231 |
+
<td><b>Qwen/Qwen2-72B-Instruct</b></td>
|
|
|
232 |
</tr>
|
233 |
<tr>
|
234 |
<td>2</td>
|
235 |
<td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
|
|
|
236 |
</tr>
|
237 |
<tr>
|
238 |
<td>3</td>
|
239 |
+
<td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
|
|
|
240 |
</tr>
|
241 |
<tr>
|
242 |
<td>4</td>
|
243 |
+
<td><b>01-ai/Yi-1.5-34B-Chat</b></td>
|
|
|
244 |
</tr>
|
245 |
<tr>
|
246 |
<td>5</td>
|
247 |
+
<td><b>CohereForAI/c4ai-command-r-plus</b></td>
|
|
|
|
|
248 |
<tr>
|
249 |
<td>6</td>
|
250 |
+
<td><b>abacusai/Smaug-72B-v0.1</b></td>
|
|
|
251 |
</tr>
|
252 |
<tr>
|
253 |
<td>7</td>
|
254 |
+
<td>Qwen/Qwen1.5-110B</td>
|
|
|
255 |
</tr>
|
256 |
<tr>
|
257 |
<td>8</td>
|
258 |
+
<td>Qwen/Qwen1.5-110B-Chat</td>
|
|
|
259 |
</tr>
|
260 |
<tr>
|
261 |
<td>9</td>
|
262 |
+
<td>microsoft/Phi-3-small-128k-instruct</td>
|
|
|
263 |
</tr>
|
264 |
<tr>
|
265 |
<td>10</td>
|
266 |
+
<td> 01-ai/Yi-1.5-9B-Chat</td>
|
|
|
267 |
</tr>
|
268 |
</table>
|
|
|
|
|
|
|
269 |
|
270 |
+
<p>Here is a detail of the changes in rankings:</p>
|
271 |
|
272 |
<div class="main-plot-container">
|
273 |
<figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
|
src/index.html
CHANGED
@@ -212,77 +212,62 @@
|
|
212 |
|
213 |
<h2>New leaderboard, new results!</h2>
|
214 |
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
215 |
-
<aside>As the cluster has been extremely full,
|
216 |
|
217 |
<h3>What do the rankings look like?</h3>
|
218 |
|
219 |
-
<p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version,
|
220 |
-
<p>We’ve been particularly impressed by
|
221 |
-
<p>
|
222 |
-
<p>
|
223 |
|
224 |
<table>
|
225 |
<tr>
|
226 |
<th>Rank</th>
|
227 |
-
<th>Leaderboard
|
228 |
-
<th>Leaderboard v2</th>
|
229 |
</tr>
|
230 |
<tr>
|
231 |
<td>⭐</td>
|
232 |
-
<td><b>
|
233 |
-
<td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
|
234 |
</tr>
|
235 |
<tr>
|
236 |
<td>2</td>
|
237 |
<td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
|
238 |
-
<td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
|
239 |
</tr>
|
240 |
<tr>
|
241 |
<td>3</td>
|
242 |
-
<td><
|
243 |
-
<td>01-ai/Yi-1.5-34B-Chat</td>
|
244 |
</tr>
|
245 |
<tr>
|
246 |
<td>4</td>
|
247 |
-
<td>
|
248 |
-
<td><b>abacusai/Smaug-72B-v0.1</b></td>
|
249 |
</tr>
|
250 |
<tr>
|
251 |
<td>5</td>
|
252 |
-
<td>
|
253 |
-
<td><b>CohereForAI/c4ai-command-r-plus<b></td>
|
254 |
-
</tr>
|
255 |
<tr>
|
256 |
<td>6</td>
|
257 |
-
<td><b>
|
258 |
-
<td>Qwen/Qwen1.5-110B-Chat</td>
|
259 |
</tr>
|
260 |
<tr>
|
261 |
<td>7</td>
|
262 |
-
<td
|
263 |
-
<td>NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO</td>
|
264 |
</tr>
|
265 |
<tr>
|
266 |
<td>8</td>
|
267 |
-
<td>
|
268 |
-
<td><b>meta-llama/Meta-Llama-3-70B</b></td>
|
269 |
</tr>
|
270 |
<tr>
|
271 |
<td>9</td>
|
272 |
-
<td
|
273 |
-
<td>01-ai/Yi-1.5-9B-Chat</td>
|
274 |
</tr>
|
275 |
<tr>
|
276 |
<td>10</td>
|
277 |
-
<td>01-ai/Yi-1.5-
|
278 |
-
<td>01-ai/Yi-1.5-34B-32K</td>
|
279 |
</tr>
|
280 |
</table>
|
281 |
-
<p>We’ve been particularly impressed by Llama-70B-instruct, who is the best model across many evaluations (though it has 15 points less than it’s base counterpart on GPQA - does instruct tuning remove knowledge?).</p>
|
282 |
-
|
283 |
-
<p>Interestingly, a new challenger climbed the ranks to arrive in 2nd place despite its smaller size: Phi-3-medium-4K-instruct, only 13B parameters but a performance equivalent to models 2 to 4 times its size.</p>
|
284 |
|
285 |
-
<p>
|
286 |
|
287 |
<div class="main-plot-container">
|
288 |
<figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
|
|
|
212 |
|
213 |
<h2>New leaderboard, new results!</h2>
|
214 |
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
215 |
+
<aside>As the cluster has been extremely full, you can expect models to keep on appearing in the next days! </aside>
|
216 |
|
217 |
<h3>What do the rankings look like?</h3>
|
218 |
|
219 |
+
<p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
|
220 |
+
<p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long range reasoning, and knowledge)</p>
|
221 |
+
<p>The current second best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge.</p>
|
222 |
+
<p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
|
223 |
|
224 |
<table>
|
225 |
<tr>
|
226 |
<th>Rank</th>
|
227 |
+
<th>New Leaderboard Ranking</th>
|
|
|
228 |
</tr>
|
229 |
<tr>
|
230 |
<td>⭐</td>
|
231 |
+
<td><b>Qwen/Qwen2-72B-Instruct</b></td>
|
|
|
232 |
</tr>
|
233 |
<tr>
|
234 |
<td>2</td>
|
235 |
<td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
|
|
|
236 |
</tr>
|
237 |
<tr>
|
238 |
<td>3</td>
|
239 |
+
<td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
|
|
|
240 |
</tr>
|
241 |
<tr>
|
242 |
<td>4</td>
|
243 |
+
<td><b>01-ai/Yi-1.5-34B-Chat</b></td>
|
|
|
244 |
</tr>
|
245 |
<tr>
|
246 |
<td>5</td>
|
247 |
+
<td><b>CohereForAI/c4ai-command-r-plus</b></td>
|
|
|
|
|
248 |
<tr>
|
249 |
<td>6</td>
|
250 |
+
<td><b>abacusai/Smaug-72B-v0.1</b></td>
|
|
|
251 |
</tr>
|
252 |
<tr>
|
253 |
<td>7</td>
|
254 |
+
<td>Qwen/Qwen1.5-110B</td>
|
|
|
255 |
</tr>
|
256 |
<tr>
|
257 |
<td>8</td>
|
258 |
+
<td>Qwen/Qwen1.5-110B-Chat</td>
|
|
|
259 |
</tr>
|
260 |
<tr>
|
261 |
<td>9</td>
|
262 |
+
<td>microsoft/Phi-3-small-128k-instruct</td>
|
|
|
263 |
</tr>
|
264 |
<tr>
|
265 |
<td>10</td>
|
266 |
+
<td> 01-ai/Yi-1.5-9B-Chat</td>
|
|
|
267 |
</tr>
|
268 |
</table>
|
|
|
|
|
|
|
269 |
|
270 |
+
<p>Here is a detail of the changes in rankings:</p>
|
271 |
|
272 |
<div class="main-plot-container">
|
273 |
<figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
|