Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

osanseviero HF staff commited on 9 days ago

Commit

863f8f5

•

1 Parent(s): 5405ff7

Update src/index.html

Browse files

Files changed (1) hide show

src/index.html +13 -13

src/index.html CHANGED Viewed

@@ -173,12 +173,12 @@
     <h2>Focusing on the models most relevant to the community</h2>
         <h3>Introducing the <em>maintainer’s highlight</em></h3>
-            <p>Throughout the year, we’ve evaluated more than 7.5K models, and observed that not all of them were used as much by the community.</p>
-            <p>The most used ones are usually new base pretrained models, often built by using a lot of compute and which can later be fine-tuned by the community for their own use cases (such as Meta’s Llama3 or Alibaba’s Qwen2). Some high quality chat or instruction models also find a large user community, for instance Cohere’s Command + R, and become also strong starting points for community experiments. ♥️</p>
-            <p>However, the story can be different for other models, even when ranking on top of the leaderboard. A number of models are experimental, fascinating and impressive concatenations of more than 20 steps of fine-tuning or merging. </p>
-            <p>However these models present some challenges as:</p>
                 <ul>
-                    <li> When stacking so many steps, it can be easy to lose the precise model recipe and history, as some parent models can get deleted, fine-tuning information of a prior step can disappear, etc. </li>
                     <li>Models can then become accidentally contaminated 😓
                         </br>This happened several times last year, with models derived from parent models fine-tuned on instruction datasets containing information from TruthfulQA or GSM8K.
                     </li>
@@ -186,19 +186,19 @@
                         </br> This can happen if you select models to merge based on their high performance on the same benchmarks - it seems to improve performance selectively on said benchmarks, without actually correlating with quality in real life situations. (More research is likely needed on this).
                     </li>
                 </ul>
-            <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
-            <p>In this list, you’ll find LLMs from model creators with access to a lot of compute power such as Meta,Google, Cohere or Mistral, as well as well known collectives, like EleutherAI or NousResearch, and power users of the Hugging Face hub, among others.</p>
-            <p>We plan to make this list evolutive based on community suggestions and our own observations, and will aim to include as much as possible SOTA LLMs as they come out and keep evaluating these models in priority.</p>
-            <p>We hope it will also make it easier for non ML users to orient themselves among the many, many models we’ll rank on the leaderboard.</p>
         <h3>Voting on model relevance</h3>
-            <p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a “first submitted, first evaluated” manner. With users sometimes submitting many LLMs variants at once and the Open LLM Leaderboard running on the limited compute of the spare cycles on the Hugging Face science cluster, we’ve decided to introduce a voting system for submitted models. The community will be able to vote for models and we will prioritize running models with the most votes first, hopefully surfacing the most awaited models on the top of the priority stack. If a model gets an extremely high number of votes when the cluster is full, we could even consider running it manually in place of other internal jobs at Hugging Face.</p>
-            <p>To avoid spamming the vote system, users will need to be connected to their Hugging Face account to vote, and we will save the votes. We hope this system will help us prioritize models that the community is enthusiastic about.</p>
             <p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
         <h3>Better and simpler interface</h3>
-            <p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
-            <p>This is thanks to the work of the Gradio team, notably [Freddy Boulton](https://huggingface.co/freddyaboulton), who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a [component](https://huggingface.co/spaces/freddyaboulton/gradio_leaderboard) that you can re-use yourself in your own leaderboard!</p>
             <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
     <h2>New leaderboard, new results!</h2>

     <h2>Focusing on the models most relevant to the community</h2>
         <h3>Introducing the <em>maintainer’s highlight</em></h3>
+            <p>Throughout the year, we’ve evaluated more than 7500 models and observed that many of them were not used as much by the community.</p>
+            <p>The most used ones are usually new base pretrained models, often built using a lot of compute and which the community can later fine-tune for their use cases (such as Meta’s Llama3 or Alibaba’s Qwen2). Some high-quality chat or instruction models find a large user community, such as Cohere’s Command + R, and become strong starting points for community experiments. ♥️</p>
+            <p>However, the story can be different for other models, even when ranking on top of the leaderboard. Several models are experimental, fascinating, and impressive concatenations that consist of more than 20 steps of fine-tuning or merging.</p>
+            <p>However, these models present some challenges:</p>
                 <ul>
+                    <li>When stacking so many steps, it can be easy to lose the precise model recipe and history, as some parent models can get deleted, fine-tuning information of a prior step can disappear, etc. </li>
                     <li>Models can then become accidentally contaminated 😓
                         </br>This happened several times last year, with models derived from parent models fine-tuned on instruction datasets containing information from TruthfulQA or GSM8K.
                     </li>
                         </br> This can happen if you select models to merge based on their high performance on the same benchmarks - it seems to improve performance selectively on said benchmarks, without actually correlating with quality in real life situations. (More research is likely needed on this).
                     </li>
                 </ul>
+            <p>To highlight high-quality models in the leaderboard and prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
+            <p>In this list, you’ll find LLMs from model creators with access to a lot of computing power, such as Meta,Google, Cohere, or Mistral, as well as well-known collectives, like EleutherAI or NousResearch, and power users of the Hugging Face Hub, among others.</p>
+            <p>We plan to make this list evolutive based on community suggestions and our observations. We will aim to include SOTA LLMs as they come out and keep evaluating these models as a priority.</p>
+            <p>We hope it will also make it easier for non-ML users to orient themselves among the many, many models we’ll rank on the leaderboard.</p>
         <h3>Voting on model relevance</h3>
+            <p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a queue (“first submitted, first evaluated”) manner. With users sometimes submitting many LLM variants at once and the Open LLM Leaderboard running on the limited compute of the spare cycles on the Hugging Face science cluster, we’ve decided to introduce a voting system for submitted models. The community will be able to vote for models, and we will prioritize running models with the most votes first, surfacing the most awaited models at the top of the priority queue. If a model gets an extremely high number of votes when the cluster is full, we could even consider running it manually instead of other internal jobs at Hugging Face.</p>
+            <p>To avoid spamming the voting system, users must be connected to their Hugging Face account to vote, and we will save the votes. This system will help us prioritize models that the community is enthusiastic about.</p>
             <p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
         <h3>Better and simpler interface</h3>
+            <p>If you’re one of our regular users, you may have noticed that our front end has become much faster in the last month.</p>
+            <p>This is thanks to the work of the Gradio team, notably [Freddy Boulton](https://huggingface.co/freddyaboulton), who developed a Leaderboard gradio component! It notably loads data on the client side, which makes any column selection or search virtually instantaneous! It’s also a [component](https://huggingface.co/spaces/freddyaboulton/gradio_leaderboard) that you can reuse yourself in your own leaderboard!</p>
             <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
     <h2>New leaderboard, new results!</h2>