Spaces:
Running
Running
Commit
•
863f8f5
1
Parent(s):
5405ff7
Update src/index.html
Browse files- src/index.html +13 -13
src/index.html
CHANGED
@@ -173,12 +173,12 @@
|
|
173 |
|
174 |
<h2>Focusing on the models most relevant to the community</h2>
|
175 |
<h3>Introducing the <em>maintainer’s highlight</em></h3>
|
176 |
-
<p>Throughout the year, we’ve evaluated more than
|
177 |
-
<p>The most used ones are usually new base pretrained models, often built
|
178 |
-
<p>However, the story can be different for other models, even when ranking on top of the leaderboard.
|
179 |
-
<p>However these models present some challenges
|
180 |
<ul>
|
181 |
-
<li>
|
182 |
<li>Models can then become accidentally contaminated 😓
|
183 |
</br>This happened several times last year, with models derived from parent models fine-tuned on instruction datasets containing information from TruthfulQA or GSM8K.
|
184 |
</li>
|
@@ -186,19 +186,19 @@
|
|
186 |
</br> This can happen if you select models to merge based on their high performance on the same benchmarks - it seems to improve performance selectively on said benchmarks, without actually correlating with quality in real life situations. (More research is likely needed on this).
|
187 |
</li>
|
188 |
</ul>
|
189 |
-
<p>To highlight high
|
190 |
-
<p>In this list, you’ll find LLMs from model creators with access to a lot of
|
191 |
-
<p>We plan to make this list evolutive based on community suggestions and our
|
192 |
-
<p>We hope it will also make it easier for non
|
193 |
|
194 |
<h3>Voting on model relevance</h3>
|
195 |
-
<p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a “first submitted, first evaluated” manner. With users sometimes submitting many
|
196 |
-
<p>To avoid spamming the
|
197 |
<p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
|
198 |
|
199 |
<h3>Better and simpler interface</h3>
|
200 |
-
<p>If you’re
|
201 |
-
<p>This is thanks to the work of the Gradio team, notably [Freddy Boulton](https://huggingface.co/freddyaboulton), who developed a Leaderboard
|
202 |
<p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
|
203 |
|
204 |
<h2>New leaderboard, new results!</h2>
|
|
|
173 |
|
174 |
<h2>Focusing on the models most relevant to the community</h2>
|
175 |
<h3>Introducing the <em>maintainer’s highlight</em></h3>
|
176 |
+
<p>Throughout the year, we’ve evaluated more than 7500 models and observed that many of them were not used as much by the community.</p>
|
177 |
+
<p>The most used ones are usually new base pretrained models, often built using a lot of compute and which the community can later fine-tune for their use cases (such as Meta’s Llama3 or Alibaba’s Qwen2). Some high-quality chat or instruction models find a large user community, such as Cohere’s Command + R, and become strong starting points for community experiments. ♥️</p>
|
178 |
+
<p>However, the story can be different for other models, even when ranking on top of the leaderboard. Several models are experimental, fascinating, and impressive concatenations that consist of more than 20 steps of fine-tuning or merging.</p>
|
179 |
+
<p>However, these models present some challenges:</p>
|
180 |
<ul>
|
181 |
+
<li>When stacking so many steps, it can be easy to lose the precise model recipe and history, as some parent models can get deleted, fine-tuning information of a prior step can disappear, etc. </li>
|
182 |
<li>Models can then become accidentally contaminated 😓
|
183 |
</br>This happened several times last year, with models derived from parent models fine-tuned on instruction datasets containing information from TruthfulQA or GSM8K.
|
184 |
</li>
|
|
|
186 |
</br> This can happen if you select models to merge based on their high performance on the same benchmarks - it seems to improve performance selectively on said benchmarks, without actually correlating with quality in real life situations. (More research is likely needed on this).
|
187 |
</li>
|
188 |
</ul>
|
189 |
+
<p>To highlight high-quality models in the leaderboard and prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
|
190 |
+
<p>In this list, you’ll find LLMs from model creators with access to a lot of computing power, such as Meta,Google, Cohere, or Mistral, as well as well-known collectives, like EleutherAI or NousResearch, and power users of the Hugging Face Hub, among others.</p>
|
191 |
+
<p>We plan to make this list evolutive based on community suggestions and our observations. We will aim to include SOTA LLMs as they come out and keep evaluating these models as a priority.</p>
|
192 |
+
<p>We hope it will also make it easier for non-ML users to orient themselves among the many, many models we’ll rank on the leaderboard.</p>
|
193 |
|
194 |
<h3>Voting on model relevance</h3>
|
195 |
+
<p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a queue (“first submitted, first evaluated”) manner. With users sometimes submitting many LLM variants at once and the Open LLM Leaderboard running on the limited compute of the spare cycles on the Hugging Face science cluster, we’ve decided to introduce a voting system for submitted models. The community will be able to vote for models, and we will prioritize running models with the most votes first, surfacing the most awaited models at the top of the priority queue. If a model gets an extremely high number of votes when the cluster is full, we could even consider running it manually instead of other internal jobs at Hugging Face.</p>
|
196 |
+
<p>To avoid spamming the voting system, users must be connected to their Hugging Face account to vote, and we will save the votes. This system will help us prioritize models that the community is enthusiastic about.</p>
|
197 |
<p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
|
198 |
|
199 |
<h3>Better and simpler interface</h3>
|
200 |
+
<p>If you’re one of our regular users, you may have noticed that our front end has become much faster in the last month.</p>
|
201 |
+
<p>This is thanks to the work of the Gradio team, notably [Freddy Boulton](https://huggingface.co/freddyaboulton), who developed a Leaderboard gradio component! It notably loads data on the client side, which makes any column selection or search virtually instantaneous! It’s also a [component](https://huggingface.co/spaces/freddyaboulton/gradio_leaderboard) that you can reuse yourself in your own leaderboard!</p>
|
202 |
<p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
|
203 |
|
204 |
<h2>New leaderboard, new results!</h2>
|