Clémentine commited on
Commit
b7060d2
1 Parent(s): bb264aa

change gradio to component instead of iframe

Browse files
Files changed (2) hide show
  1. dist/index.html +6 -2
  2. src/index.html +6 -2
dist/index.html CHANGED
@@ -115,7 +115,7 @@
115
  <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
116
  <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
117
 
118
- <iframe src="https://open-llm-leaderboard/sample_viewer.hf.space"></iframe>
119
 
120
  <h3>Why did we choose these subsets?</h3>
121
  <p>In summary, our criterion were: </p>
@@ -173,7 +173,7 @@
173
  <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
174
  <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
175
 
176
- <iframe src="https://open-llm-leaderboard/GenerationVisualizer.hf.space"></iframe>
177
 
178
  <p>You can explore the visualiser we used here!</p>
179
 
@@ -461,5 +461,9 @@
461
  <script>
462
  includeHTML();
463
  </script>
 
 
 
 
464
  </body>
465
  </html>
 
115
  <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
116
  <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
117
 
118
+ <gradio-app src="https://open-llm-leaderboard-sample_viewer.hf.space"></gradio-app>
119
 
120
  <h3>Why did we choose these subsets?</h3>
121
  <p>In summary, our criterion were: </p>
 
173
  <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
174
  <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
175
 
176
+ <gradio-app src="https://open-llm-leaderboard-GenerationVisualizer.hf.space"></gradio-app>
177
 
178
  <p>You can explore the visualiser we used here!</p>
179
 
 
461
  <script>
462
  includeHTML();
463
  </script>
464
+ <script
465
+ type="module"
466
+ src="https://gradio.s3-us-west-2.amazonaws.com/4.36.0/gradio.js"
467
+ ></script>
468
  </body>
469
  </html>
src/index.html CHANGED
@@ -115,7 +115,7 @@
115
  <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
116
  <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
117
 
118
- <iframe src="https://open-llm-leaderboard/sample_viewer.hf.space"></iframe>
119
 
120
  <h3>Why did we choose these subsets?</h3>
121
  <p>In summary, our criterion were: </p>
@@ -173,7 +173,7 @@
173
  <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
174
  <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
175
 
176
- <iframe src="https://open-llm-leaderboard/GenerationVisualizer.hf.space"></iframe>
177
 
178
  <p>You can explore the visualiser we used here!</p>
179
 
@@ -461,5 +461,9 @@
461
  <script>
462
  includeHTML();
463
  </script>
 
 
 
 
464
  </body>
465
  </html>
 
115
  <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
116
  <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
117
 
118
+ <gradio-app src="https://open-llm-leaderboard-sample_viewer.hf.space"></gradio-app>
119
 
120
  <h3>Why did we choose these subsets?</h3>
121
  <p>In summary, our criterion were: </p>
 
173
  <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
174
  <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
175
 
176
+ <gradio-app src="https://open-llm-leaderboard-GenerationVisualizer.hf.space"></gradio-app>
177
 
178
  <p>You can explore the visualiser we used here!</p>
179
 
 
461
  <script>
462
  includeHTML();
463
  </script>
464
+ <script
465
+ type="module"
466
+ src="https://gradio.s3-us-west-2.amazonaws.com/4.36.0/gradio.js"
467
+ ></script>
468
  </body>
469
  </html>