xuanricheng commited on
Commit
175efb2
โ€ข
1 Parent(s): f97f2b7

update about

Browse files
Files changed (3) hide show
  1. README.md +1 -1
  2. src/display/about.py +23 -111
  3. src/display/utils.py +6 -6
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Chinese Open LLM Leaderboard
3
  emoji: ๐Ÿ†
4
  colorFrom: green
5
  colorTo: indigo
 
1
  ---
2
+ title: Open Chinese LLM Leaderboard
3
  emoji: ๐Ÿ†
4
  colorFrom: green
5
  colorTo: indigo
src/display/about.py CHANGED
@@ -1,29 +1,37 @@
1
  from src.display.utils import ModelType
2
 
3
- TITLE = """<h1 align="center" id="space-title">๐Ÿค— Open Chinese LLM Leaderboard</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
- ๐Ÿ“ The ๐Ÿค— Open Chinese LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
7
- This leaderboard is subset of the [FlagEval](https://flageval.baai.ac.cn/)
 
 
 
 
 
8
 
9
- ๐Ÿค— Submit a model for automated evaluation on the ๐Ÿค— GPU cluster on the "Submit" page!
10
- The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details in the "About" page!
11
  """
12
 
13
  LLM_BENCHMARKS_TEXT = f"""
14
  # Context
15
- With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
 
 
 
 
16
 
17
  ## How it works
18
 
19
  ๐Ÿ“ˆ We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
20
 
21
- - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
22
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
23
- - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
24
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
25
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
26
  - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
 
 
27
 
28
  For all these evaluations, a higher score is a better score.
29
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
@@ -43,12 +51,13 @@ The total batch size we get for models which fit on one A100 node is 8 (8 GPUs *
43
  *You can expect results to vary slightly for different batch sizes because of padding.*
44
 
45
  The tasks and few shots parameters are:
46
- - ARC: 25-shot, *arc-challenge* (`acc_norm`)
47
- - HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
48
- - TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
49
- - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
50
- - Winogrande: 5-shot, *winogrande* (`acc`)
51
- - GSM8k: 5-shot, *gsm8k* (`acc`)
 
52
 
53
  Side note on the baseline scores:
54
  - for log-likelihood evaluation, we select the random baseline
@@ -63,14 +72,9 @@ If there is no icon, we have not uploaded the information on the model yet, feel
63
 
64
  "Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
65
 
66
- ## Quantization
67
- To get more information about quantization, see:
68
- - 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
69
- - 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
70
 
71
  ## Useful links
72
  - [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
73
- - [Collection of best models](https://huggingface.co/collections/open-cn-llm-leaderboard/chinese-llm-leaderboard-best-models-65b0d4511dbd85fd0c3ad9cd)
74
  """
75
 
76
  FAQ_TEXT = """
@@ -170,96 +174,4 @@ If everything is done, check you can launch the EleutherAIHarness on your model
170
 
171
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
172
  CITATION_BUTTON_TEXT = r"""
173
- @misc{open-llm-leaderboard,
174
- author = {Edward Beeching and Clรฉmentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf},
175
- title = {Open LLM Leaderboard},
176
- year = {2023},
177
- publisher = {Hugging Face},
178
- howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
179
- }
180
- @software{eval-harness,
181
- author = {Gao, Leo and
182
- Tow, Jonathan and
183
- Biderman, Stella and
184
- Black, Sid and
185
- DiPofi, Anthony and
186
- Foster, Charles and
187
- Golding, Laurence and
188
- Hsu, Jeffrey and
189
- McDonell, Kyle and
190
- Muennighoff, Niklas and
191
- Phang, Jason and
192
- Reynolds, Laria and
193
- Tang, Eric and
194
- Thite, Anish and
195
- Wang, Ben and
196
- Wang, Kevin and
197
- Zou, Andy},
198
- title = {A framework for few-shot language model evaluation},
199
- month = sep,
200
- year = 2021,
201
- publisher = {Zenodo},
202
- version = {v0.0.1},
203
- doi = {10.5281/zenodo.5371628},
204
- url = {https://doi.org/10.5281/zenodo.5371628}
205
- }
206
- @misc{clark2018think,
207
- title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
208
- author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
209
- year={2018},
210
- eprint={1803.05457},
211
- archivePrefix={arXiv},
212
- primaryClass={cs.AI}
213
- }
214
- @misc{zellers2019hellaswag,
215
- title={HellaSwag: Can a Machine Really Finish Your Sentence?},
216
- author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
217
- year={2019},
218
- eprint={1905.07830},
219
- archivePrefix={arXiv},
220
- primaryClass={cs.CL}
221
- }
222
- @misc{hendrycks2021measuring,
223
- title={Measuring Massive Multitask Language Understanding},
224
- author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
225
- year={2021},
226
- eprint={2009.03300},
227
- archivePrefix={arXiv},
228
- primaryClass={cs.CY}
229
- }
230
- @misc{lin2022truthfulqa,
231
- title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
232
- author={Stephanie Lin and Jacob Hilton and Owain Evans},
233
- year={2022},
234
- eprint={2109.07958},
235
- archivePrefix={arXiv},
236
- primaryClass={cs.CL}
237
- }
238
- @misc{DBLP:journals/corr/abs-1907-10641,
239
- title={{WINOGRANDE:} An Adversarial Winograd Schema Challenge at Scale},
240
- author={Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
241
- year={2019},
242
- eprint={1907.10641},
243
- archivePrefix={arXiv},
244
- primaryClass={cs.CL}
245
- }
246
- @misc{DBLP:journals/corr/abs-2110-14168,
247
- title={Training Verifiers to Solve Math Word Problems},
248
- author={Karl Cobbe and
249
- Vineet Kosaraju and
250
- Mohammad Bavarian and
251
- Mark Chen and
252
- Heewoo Jun and
253
- Lukasz Kaiser and
254
- Matthias Plappert and
255
- Jerry Tworek and
256
- Jacob Hilton and
257
- Reiichiro Nakano and
258
- Christopher Hesse and
259
- John Schulman},
260
- year={2021},
261
- eprint={2110.14168},
262
- archivePrefix={arXiv},
263
- primaryClass={cs.CL}
264
- }
265
  """
 
1
  from src.display.utils import ModelType
2
 
3
+ TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
+ Open Chinese LLM Leaderboard ๆ—จๅœจ่ทŸ่ธชใ€ๆŽ’ๅๅ’Œ่ฏ„ไผฐๅผ€ๆ”พๅผไธญๆ–‡ๅคง่ฏญ่จ€ๆจกๅž‹๏ผˆLLM๏ผ‰ใ€‚ๆœฌๆŽ’่กŒๆฆœ็”ฑFlagEvalๅนณๅฐๆไพ›็›ธๅบ”็ฎ—ๅŠ›ๅ’Œ่ฟ่กŒ็Žฏๅขƒใ€‚
7
+ ่ฏ„ไผฐๆ•ฐๆฎ้›†ๆ˜ฏๅ…จ้ƒจ้ƒฝๆ˜ฏไธญๆ–‡ๆ•ฐๆฎ้›†ไปฅ่ฏ„ไผฐไธญๆ–‡่ƒฝๅŠ›ๅฆ‚้œ€ๆŸฅ็œ‹่ฏฆๆƒ…ไฟกๆฏ๏ผŒ่ฏทๆŸฅ้˜…โ€˜ๅ…ณไบŽโ€™้กต้ขใ€‚
8
+ ๅฆ‚้œ€ๅฏนๆจกๅž‹่ฟ›่กŒๆ›ดๅ…จ้ข็š„่ฏ„ๆต‹๏ผŒๅฏไปฅ็™ปๅฝ•FlagEvalๅนณๅฐ๏ผŒไฝ“้ชŒๆ›ดๅŠ ๅฎŒๅ–„็š„ๆจกๅž‹่ฏ„ๆต‹ๅŠŸ่ƒฝใ€‚
9
+
10
+ The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the [FlagEval](https://flageval.baai.ac.cn/) platform, providing corresponding computational resources and runtime environment.
11
+ The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
12
+ For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
13
 
 
 
14
  """
15
 
16
  LLM_BENCHMARKS_TEXT = f"""
17
  # Context
18
+ Open Chinese LLM Leaderboardๆ˜ฏไธญๆ–‡ๅคง่ฏญ่จ€ๆŽ’่กŒๆฆœ๏ผŒๆˆ‘ไปฌๅธŒๆœ›่ƒฝๅคŸๆŽจๅŠจๆ›ดๅŠ ๅผ€ๆ”พ็š„็”Ÿๆ€๏ผŒ่ฎฉไธญๆ–‡ๅคง่ฏญ่จ€ๆจกๅž‹ๅผ€ๅ‘่€…ๅ‚ไธŽ่ฟ›ๆฅ๏ผŒไธบๆŽจๅŠจไธญๆ–‡็š„ๅคง่ฏญ่จ€ๆจกๅž‹่ฟ›ๆญฅๅšๅ‡บ็›ธๅบ”็š„่ดก็Œฎใ€‚
19
+ ไธบไบ†ๅฎž็Žฐๅ…ฌๅนณๆ€ง็š„็›ฎๆ ‡๏ผŒๆ‰€ๆœ‰ๆจกๅž‹้ƒฝๅœจ FlagEval ๅนณๅฐไธŠไฝฟ็”จๆ ‡ๅ‡†ๅŒ– GPU ๅ’Œ็ปŸไธ€็Žฏๅขƒ่ฟ›่กŒ่ฏ„ไผฐ๏ผŒไปฅ็กฎไฟๅ…ฌๅนณๆ€งใ€‚
20
+
21
+ The Open Chinese LLM Leaderboard serves as a ranking platform for major Chinese language models. We aspire to foster a more inclusive ecosystem, inviting developers of Chinese LLMs to contribute to the advancement of the field.
22
+ In pursuit of fairness, all models undergo evaluation on the FlagEval platform using standardized GPU and uniform environments to ensure impartiality.
23
 
24
  ## How it works
25
 
26
  ๐Ÿ“ˆ We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
27
 
28
+ - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> ARC Challenge </a> (25-shot) - a set of grade-school science questions.
29
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 
30
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
31
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
32
  - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
33
+ - <a href="https://flageval.baai.ac.cn/#/taskIntro?t=zh_qa" target="_blank"> C-SEM </a> (5-shot) - Semantic understanding is seen as a key cornerstone in the research and application of natural language processing. However, there is still a lack of publicly available benchmarks that approach from a linguistic perspective in the field of evaluating large Chinese language models.
34
+ - <a href="https://arxiv.org/abs/2306.09212" target="_blank"> CMMLU </a> (5-shot) - CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture. CMMLU covers a wide range of subjects, comprising 67 topics that span from elementary to advanced professional levels.
35
 
36
  For all these evaluations, a higher score is a better score.
37
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
 
51
  *You can expect results to vary slightly for different batch sizes because of padding.*
52
 
53
  The tasks and few shots parameters are:
54
+ - C-ARC: 25-shot, *arc-challenge* (`acc_norm`)
55
+ - C-HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
56
+ - C-TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
57
+ - C-Winogrande: 5-shot, *winogrande* (`acc`)
58
+ - C-GSM8k: 5-shot, *gsm8k* (`acc`)
59
+ - C-SEM-V2: 5-shot, cmmlu* `acc`)
60
+ - CMMLU: 5-shot, cmmlu* `acc`)
61
 
62
  Side note on the baseline scores:
63
  - for log-likelihood evaluation, we select the random baseline
 
72
 
73
  "Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
74
 
 
 
 
 
75
 
76
  ## Useful links
77
  - [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
 
78
  """
79
 
80
  FAQ_TEXT = """
 
174
 
175
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
176
  CITATION_BUTTON_TEXT = r"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  """
src/display/utils.py CHANGED
@@ -14,13 +14,13 @@ class Task:
14
  col_name: str
15
 
16
  class Tasks(Enum):
17
- arc = Task("arc:challenge", "acc_norm", "C-ARC")
18
- hellaswag = Task("hellaswag", "acc_norm", "C-HellaSwag")
19
- truthfulqa = Task("truthfulqa:mc", "mc2", "C-TruthfulQA")
20
- winogrande = Task("winogrande", "acc", "C-Winogrande")
21
- gsm8k = Task("gsm8k", "acc", "C-GSM8K")
22
  c_sem = Task("c-sem-v2", "acc", "C-SEM")
23
- mmlu = Task("cmmlu", "acc", "C-MMLU")
24
 
25
  # These classes are for user facing column names,
26
  # to avoid having to change them all around the code
 
14
  col_name: str
15
 
16
  class Tasks(Enum):
17
+ arc = Task("c_arc_challenge", "acc_norm", "C-ARC")
18
+ hellaswag = Task("c_hellaswag", "acc_norm", "C-HellaSwag")
19
+ truthfulqa = Task("c_truthfulqa_mc", "mc2", "C-TruthfulQA")
20
+ winogrande = Task("c_winogrande", "acc", "C-Winogrande")
21
+ gsm8k = Task("c_gsm8k", "acc", "C-GSM8K")
22
  c_sem = Task("c-sem-v2", "acc", "C-SEM")
23
+ mmlu = Task("cmmlu", "acc_norm", "C-MMLU")
24
 
25
  # These classes are for user facing column names,
26
  # to avoid having to change them all around the code