Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
xuanricheng
commited on
Commit
โข
175efb2
1
Parent(s):
f97f2b7
update about
Browse files- README.md +1 -1
- src/display/about.py +23 -111
- src/display/utils.py +6 -6
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
title: Chinese
|
3 |
emoji: ๐
|
4 |
colorFrom: green
|
5 |
colorTo: indigo
|
|
|
1 |
---
|
2 |
+
title: Open Chinese LLM Leaderboard
|
3 |
emoji: ๐
|
4 |
colorFrom: green
|
5 |
colorTo: indigo
|
src/display/about.py
CHANGED
@@ -1,29 +1,37 @@
|
|
1 |
from src.display.utils import ModelType
|
2 |
|
3 |
-
TITLE = """<h1 align="center" id="space-title"
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
6 |
-
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
-
๐ค Submit a model for automated evaluation on the ๐ค GPU cluster on the "Submit" page!
|
10 |
-
The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details in the "About" page!
|
11 |
"""
|
12 |
|
13 |
LLM_BENCHMARKS_TEXT = f"""
|
14 |
# Context
|
15 |
-
|
|
|
|
|
|
|
|
|
16 |
|
17 |
## How it works
|
18 |
|
19 |
๐ We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
|
20 |
|
21 |
-
- <a href="https://arxiv.org/abs/1803.05457" target="_blank">
|
22 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
23 |
-
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
24 |
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
|
25 |
- <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
|
26 |
- <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
|
|
|
|
|
27 |
|
28 |
For all these evaluations, a higher score is a better score.
|
29 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
@@ -43,12 +51,13 @@ The total batch size we get for models which fit on one A100 node is 8 (8 GPUs *
|
|
43 |
*You can expect results to vary slightly for different batch sizes because of padding.*
|
44 |
|
45 |
The tasks and few shots parameters are:
|
46 |
-
- ARC: 25-shot, *arc-challenge* (`acc_norm`)
|
47 |
-
- HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
|
48 |
-
- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
|
49 |
-
-
|
50 |
-
-
|
51 |
-
-
|
|
|
52 |
|
53 |
Side note on the baseline scores:
|
54 |
- for log-likelihood evaluation, we select the random baseline
|
@@ -63,14 +72,9 @@ If there is no icon, we have not uploaded the information on the model yet, feel
|
|
63 |
|
64 |
"Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
|
65 |
|
66 |
-
## Quantization
|
67 |
-
To get more information about quantization, see:
|
68 |
-
- 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
|
69 |
-
- 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
|
70 |
|
71 |
## Useful links
|
72 |
- [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
|
73 |
-
- [Collection of best models](https://huggingface.co/collections/open-cn-llm-leaderboard/chinese-llm-leaderboard-best-models-65b0d4511dbd85fd0c3ad9cd)
|
74 |
"""
|
75 |
|
76 |
FAQ_TEXT = """
|
@@ -170,96 +174,4 @@ If everything is done, check you can launch the EleutherAIHarness on your model
|
|
170 |
|
171 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
172 |
CITATION_BUTTON_TEXT = r"""
|
173 |
-
@misc{open-llm-leaderboard,
|
174 |
-
author = {Edward Beeching and Clรฉmentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf},
|
175 |
-
title = {Open LLM Leaderboard},
|
176 |
-
year = {2023},
|
177 |
-
publisher = {Hugging Face},
|
178 |
-
howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
|
179 |
-
}
|
180 |
-
@software{eval-harness,
|
181 |
-
author = {Gao, Leo and
|
182 |
-
Tow, Jonathan and
|
183 |
-
Biderman, Stella and
|
184 |
-
Black, Sid and
|
185 |
-
DiPofi, Anthony and
|
186 |
-
Foster, Charles and
|
187 |
-
Golding, Laurence and
|
188 |
-
Hsu, Jeffrey and
|
189 |
-
McDonell, Kyle and
|
190 |
-
Muennighoff, Niklas and
|
191 |
-
Phang, Jason and
|
192 |
-
Reynolds, Laria and
|
193 |
-
Tang, Eric and
|
194 |
-
Thite, Anish and
|
195 |
-
Wang, Ben and
|
196 |
-
Wang, Kevin and
|
197 |
-
Zou, Andy},
|
198 |
-
title = {A framework for few-shot language model evaluation},
|
199 |
-
month = sep,
|
200 |
-
year = 2021,
|
201 |
-
publisher = {Zenodo},
|
202 |
-
version = {v0.0.1},
|
203 |
-
doi = {10.5281/zenodo.5371628},
|
204 |
-
url = {https://doi.org/10.5281/zenodo.5371628}
|
205 |
-
}
|
206 |
-
@misc{clark2018think,
|
207 |
-
title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
|
208 |
-
author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
|
209 |
-
year={2018},
|
210 |
-
eprint={1803.05457},
|
211 |
-
archivePrefix={arXiv},
|
212 |
-
primaryClass={cs.AI}
|
213 |
-
}
|
214 |
-
@misc{zellers2019hellaswag,
|
215 |
-
title={HellaSwag: Can a Machine Really Finish Your Sentence?},
|
216 |
-
author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
|
217 |
-
year={2019},
|
218 |
-
eprint={1905.07830},
|
219 |
-
archivePrefix={arXiv},
|
220 |
-
primaryClass={cs.CL}
|
221 |
-
}
|
222 |
-
@misc{hendrycks2021measuring,
|
223 |
-
title={Measuring Massive Multitask Language Understanding},
|
224 |
-
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
|
225 |
-
year={2021},
|
226 |
-
eprint={2009.03300},
|
227 |
-
archivePrefix={arXiv},
|
228 |
-
primaryClass={cs.CY}
|
229 |
-
}
|
230 |
-
@misc{lin2022truthfulqa,
|
231 |
-
title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
|
232 |
-
author={Stephanie Lin and Jacob Hilton and Owain Evans},
|
233 |
-
year={2022},
|
234 |
-
eprint={2109.07958},
|
235 |
-
archivePrefix={arXiv},
|
236 |
-
primaryClass={cs.CL}
|
237 |
-
}
|
238 |
-
@misc{DBLP:journals/corr/abs-1907-10641,
|
239 |
-
title={{WINOGRANDE:} An Adversarial Winograd Schema Challenge at Scale},
|
240 |
-
author={Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
|
241 |
-
year={2019},
|
242 |
-
eprint={1907.10641},
|
243 |
-
archivePrefix={arXiv},
|
244 |
-
primaryClass={cs.CL}
|
245 |
-
}
|
246 |
-
@misc{DBLP:journals/corr/abs-2110-14168,
|
247 |
-
title={Training Verifiers to Solve Math Word Problems},
|
248 |
-
author={Karl Cobbe and
|
249 |
-
Vineet Kosaraju and
|
250 |
-
Mohammad Bavarian and
|
251 |
-
Mark Chen and
|
252 |
-
Heewoo Jun and
|
253 |
-
Lukasz Kaiser and
|
254 |
-
Matthias Plappert and
|
255 |
-
Jerry Tworek and
|
256 |
-
Jacob Hilton and
|
257 |
-
Reiichiro Nakano and
|
258 |
-
Christopher Hesse and
|
259 |
-
John Schulman},
|
260 |
-
year={2021},
|
261 |
-
eprint={2110.14168},
|
262 |
-
archivePrefix={arXiv},
|
263 |
-
primaryClass={cs.CL}
|
264 |
-
}
|
265 |
"""
|
|
|
1 |
from src.display.utils import ModelType
|
2 |
|
3 |
+
TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
6 |
+
Open Chinese LLM Leaderboard ๆจๅจ่ท่ธชใๆๅๅ่ฏไผฐๅผๆพๅผไธญๆๅคง่ฏญ่จๆจกๅ๏ผLLM๏ผใๆฌๆ่กๆฆ็ฑFlagEvalๅนณๅฐๆไพ็ธๅบ็ฎๅๅ่ฟ่ก็ฏๅขใ
|
7 |
+
่ฏไผฐๆฐๆฎ้ๆฏๅ
จ้จ้ฝๆฏไธญๆๆฐๆฎ้ไปฅ่ฏไผฐไธญๆ่ฝๅๅฆ้ๆฅ็่ฏฆๆ
ไฟกๆฏ๏ผ่ฏทๆฅ้
โๅ
ณไบโ้กต้ขใ
|
8 |
+
ๅฆ้ๅฏนๆจกๅ่ฟ่กๆดๅ
จ้ข็่ฏๆต๏ผๅฏไปฅ็ปๅฝFlagEvalๅนณๅฐ๏ผไฝ้ชๆดๅ ๅฎๅ็ๆจกๅ่ฏๆตๅ่ฝใ
|
9 |
+
|
10 |
+
The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the [FlagEval](https://flageval.baai.ac.cn/) platform, providing corresponding computational resources and runtime environment.
|
11 |
+
The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
|
12 |
+
For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
|
13 |
|
|
|
|
|
14 |
"""
|
15 |
|
16 |
LLM_BENCHMARKS_TEXT = f"""
|
17 |
# Context
|
18 |
+
Open Chinese LLM Leaderboardๆฏไธญๆๅคง่ฏญ่จๆ่กๆฆ๏ผๆไปฌๅธๆ่ฝๅคๆจๅจๆดๅ ๅผๆพ็็ๆ๏ผ่ฎฉไธญๆๅคง่ฏญ่จๆจกๅๅผๅ่
ๅไธ่ฟๆฅ๏ผไธบๆจๅจไธญๆ็ๅคง่ฏญ่จๆจกๅ่ฟๆญฅๅๅบ็ธๅบ็่ดก็ฎใ
|
19 |
+
ไธบไบๅฎ็ฐๅ
ฌๅนณๆง็็ฎๆ ๏ผๆๆๆจกๅ้ฝๅจ FlagEval ๅนณๅฐไธไฝฟ็จๆ ๅๅ GPU ๅ็ปไธ็ฏๅข่ฟ่ก่ฏไผฐ๏ผไปฅ็กฎไฟๅ
ฌๅนณๆงใ
|
20 |
+
|
21 |
+
The Open Chinese LLM Leaderboard serves as a ranking platform for major Chinese language models. We aspire to foster a more inclusive ecosystem, inviting developers of Chinese LLMs to contribute to the advancement of the field.
|
22 |
+
In pursuit of fairness, all models undergo evaluation on the FlagEval platform using standardized GPU and uniform environments to ensure impartiality.
|
23 |
|
24 |
## How it works
|
25 |
|
26 |
๐ We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
|
27 |
|
28 |
+
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> ARC Challenge </a> (25-shot) - a set of grade-school science questions.
|
29 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
|
|
30 |
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
|
31 |
- <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
|
32 |
- <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
|
33 |
+
- <a href="https://flageval.baai.ac.cn/#/taskIntro?t=zh_qa" target="_blank"> C-SEM </a> (5-shot) - Semantic understanding is seen as a key cornerstone in the research and application of natural language processing. However, there is still a lack of publicly available benchmarks that approach from a linguistic perspective in the field of evaluating large Chinese language models.
|
34 |
+
- <a href="https://arxiv.org/abs/2306.09212" target="_blank"> CMMLU </a> (5-shot) - CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture. CMMLU covers a wide range of subjects, comprising 67 topics that span from elementary to advanced professional levels.
|
35 |
|
36 |
For all these evaluations, a higher score is a better score.
|
37 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
|
|
51 |
*You can expect results to vary slightly for different batch sizes because of padding.*
|
52 |
|
53 |
The tasks and few shots parameters are:
|
54 |
+
- C-ARC: 25-shot, *arc-challenge* (`acc_norm`)
|
55 |
+
- C-HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
|
56 |
+
- C-TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
|
57 |
+
- C-Winogrande: 5-shot, *winogrande* (`acc`)
|
58 |
+
- C-GSM8k: 5-shot, *gsm8k* (`acc`)
|
59 |
+
- C-SEM-V2: 5-shot, cmmlu* `acc`)
|
60 |
+
- CMMLU: 5-shot, cmmlu* `acc`)
|
61 |
|
62 |
Side note on the baseline scores:
|
63 |
- for log-likelihood evaluation, we select the random baseline
|
|
|
72 |
|
73 |
"Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
|
74 |
|
|
|
|
|
|
|
|
|
75 |
|
76 |
## Useful links
|
77 |
- [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
|
|
|
78 |
"""
|
79 |
|
80 |
FAQ_TEXT = """
|
|
|
174 |
|
175 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
176 |
CITATION_BUTTON_TEXT = r"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
177 |
"""
|
src/display/utils.py
CHANGED
@@ -14,13 +14,13 @@ class Task:
|
|
14 |
col_name: str
|
15 |
|
16 |
class Tasks(Enum):
|
17 |
-
arc = Task("
|
18 |
-
hellaswag = Task("
|
19 |
-
truthfulqa = Task("
|
20 |
-
winogrande = Task("
|
21 |
-
gsm8k = Task("
|
22 |
c_sem = Task("c-sem-v2", "acc", "C-SEM")
|
23 |
-
mmlu = Task("cmmlu", "
|
24 |
|
25 |
# These classes are for user facing column names,
|
26 |
# to avoid having to change them all around the code
|
|
|
14 |
col_name: str
|
15 |
|
16 |
class Tasks(Enum):
|
17 |
+
arc = Task("c_arc_challenge", "acc_norm", "C-ARC")
|
18 |
+
hellaswag = Task("c_hellaswag", "acc_norm", "C-HellaSwag")
|
19 |
+
truthfulqa = Task("c_truthfulqa_mc", "mc2", "C-TruthfulQA")
|
20 |
+
winogrande = Task("c_winogrande", "acc", "C-Winogrande")
|
21 |
+
gsm8k = Task("c_gsm8k", "acc", "C-GSM8K")
|
22 |
c_sem = Task("c-sem-v2", "acc", "C-SEM")
|
23 |
+
mmlu = Task("cmmlu", "acc_norm", "C-MMLU")
|
24 |
|
25 |
# These classes are for user facing column names,
|
26 |
# to avoid having to change them all around the code
|