pom
commited on
Commit
•
9460c12
1
Parent(s):
f34fb5c
update readme
Browse files
README.md
CHANGED
@@ -27,20 +27,22 @@ inference: false
|
|
27 |
|
28 |
## 评测结果
|
29 |
|
30 |
-
为验证模型的各项能力,我们选取了多个学科综合能力评测集,包括 [MMLU](https://arxiv.org/abs/2009.03300)(英文)、 [C-Eval](https://cevalbenchmark.com/)(中文)、[AGIEval](https://arxiv.org/abs/2304.06364)(中英) 、[GAOKAO-Bench](https://github.com/OpenLMLab/GAOKAO-Bench)(中英)、[GAOKAO-English](https://github.com/ExpressAI/AI-Gaokao)
|
31 |
|
32 |
| 模型 | 类型 | MMLU | C-Eval | AGIEval<sup>1</sup> | GAOKAO-Bench<sup>1</sup> | GAOKAO-English<sup>1</sup> |
|
33 |
| :----------------: | :--: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
|
34 |
| Baichuan-7B | 底座 | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
|
35 |
| Baichuan2-7B-Base | 底座 | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
|
|
|
36 |
| ChatGLM2-6B | 对话 | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
|
37 |
| Falcon-7B | 底座 | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
|
38 |
| InternLM-7B | 底座 | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
|
|
|
39 |
| Llama-7B | 底座 | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
|
40 |
| Llama-2-7B | 底座 | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
|
41 |
| MPT-7B | 底座 | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
|
42 |
| Vicuna-7B-v1.5 | 对话 | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
|
43 |
-
| **XVERSE-7B** | 底座 |
|
44 |
|
45 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
46 |
> <sup>2:来源于各模型官方的汇报结果</sup>
|
@@ -55,14 +57,16 @@ In order to validate the various abilities of the model, we have chosen several
|
|
55 |
| :----------------: | :--------: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
|
56 |
| Baichuan-7B | pretrained | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
|
57 |
| Baichuan2-7B-Base | pretrained | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
|
|
|
58 |
| ChatGLM2-6B | fine-tuned | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
|
59 |
| Falcon-7B | pretrained | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
|
60 |
| InternLM-7B | pretrained | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
|
|
|
61 |
| Llama-7B | pretrained | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
|
62 |
| Llama-2-7B | pretrained | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
|
63 |
| MPT-7B | pretrained | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
|
64 |
| Vicuna-7B-v1.5 | fine-tuned | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
|
65 |
-
| **XVERSE-7B** | pretrained |
|
66 |
|
67 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
68 |
> <sup>2: Reporting results from official results of each model.</sup>
|
@@ -76,6 +80,7 @@ MMLU Category Results
|
|
76 |
| Models | Type | Average | STEM | Social Science | Humanities | Others |
|
77 |
| :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
|
78 |
| Baichuan-7B | pretrained | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 |
|
|
|
79 |
| ChatGLM2-6B | pretrained | 45.5 | 40.1 | 51.6 | 41.2 | 51.2 |
|
80 |
| InternLM-7B | pretrained | 51.0 | **58.7** | 43.5 | **52.7** | 53.2 |
|
81 |
| LLaMA-7B | pretrained | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 |
|
@@ -90,9 +95,11 @@ C-Eval Category Results
|
|
90 |
| :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
|
91 |
| Baichuan-7B | pretrained | 42.8 | 38.2 | 52.0 | 46.2 | 39.3 |
|
92 |
| Baichuan2-7B-Base | pretrained | 54.9 | 47.9 | 67.3 | 58.4 | 52.8 |
|
|
|
93 |
| ChatGLM2-6B | fine-tuned | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
|
94 |
| Falcon-7B | pretrained | 25.8 | 25.8 | 26.0 | 25.8 | 25.7 |
|
95 |
| InternLM-7B | pretrained | 52.4 | 47.0 | 64.9 | 55.6 | 47.6 |
|
|
|
96 |
| LLaMA-7B | pretrained | 27.0 | 26.7 | 26.7 | 28.4 | 26.2 |
|
97 |
| LLaMA2-7B | pretrained | 28.9 | 26.8 | 34.5 | 30.0 | 26.4 |
|
98 |
| MPT-7B | pretrained | 27.8 | 27.4 | 29.8 | 26.9 | 27.7 |
|
|
|
27 |
|
28 |
## 评测结果
|
29 |
|
30 |
+
为验证模型的各项能力,我们选取了多个学科综合能力评测集,包括 [MMLU](https://arxiv.org/abs/2009.03300)(英文)、 [C-Eval](https://cevalbenchmark.com/)(中文)、[AGIEval](https://arxiv.org/abs/2304.06364)(中英) 、[GAOKAO-Bench](https://github.com/OpenLMLab/GAOKAO-Bench)(中英)、[GAOKAO-English](https://github.com/ExpressAI/AI-Gaokao)(英文),评测结果如下(粗体表示各项最高得分):
|
31 |
|
32 |
| 模型 | 类型 | MMLU | C-Eval | AGIEval<sup>1</sup> | GAOKAO-Bench<sup>1</sup> | GAOKAO-English<sup>1</sup> |
|
33 |
| :----------------: | :--: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
|
34 |
| Baichuan-7B | 底座 | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
|
35 |
| Baichuan2-7B-Base | 底座 | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
|
36 |
+
| Baichuan2-7B-Chat | 对话 | 53.2 | 52.2 | 41.3 | 49.7 | 66.6 |
|
37 |
| ChatGLM2-6B | 对话 | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
|
38 |
| Falcon-7B | 底座 | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
|
39 |
| InternLM-7B | 底座 | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
|
40 |
+
| InternLM-7B-Chat | 对话 | 50.8<sup>2</sup> | 52.8 | 39.0 | **67.4** | 43.9 |
|
41 |
| Llama-7B | 底座 | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
|
42 |
| Llama-2-7B | 底座 | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
|
43 |
| MPT-7B | 底座 | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
|
44 |
| Vicuna-7B-v1.5 | 对话 | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
|
45 |
+
| **XVERSE-7B** | 底座 | **56.6** | **57.1** | **46.9** | 61.7 | **71.1** |
|
46 |
|
47 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
48 |
> <sup>2:来源于各模型官方的汇报结果</sup>
|
|
|
57 |
| :----------------: | :--------: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
|
58 |
| Baichuan-7B | pretrained | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
|
59 |
| Baichuan2-7B-Base | pretrained | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
|
60 |
+
| Baichuan2-7B-Chat | fine-tuned | 53.2 | 52.2 | 41.3 | 49.7 | 66.6 |
|
61 |
| ChatGLM2-6B | fine-tuned | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
|
62 |
| Falcon-7B | pretrained | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
|
63 |
| InternLM-7B | pretrained | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
|
64 |
+
| InternLM-7B-Chat | fine-tuned | 50.8<sup>2</sup> | 52.8 | 39.0 | **67.4** | 43.9 |
|
65 |
| Llama-7B | pretrained | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
|
66 |
| Llama-2-7B | pretrained | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
|
67 |
| MPT-7B | pretrained | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
|
68 |
| Vicuna-7B-v1.5 | fine-tuned | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
|
69 |
+
| **XVERSE-7B** | pretrained | **56.6** | **57.1** | **46.9** | 61.7 | **71.1** |
|
70 |
|
71 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
72 |
> <sup>2: Reporting results from official results of each model.</sup>
|
|
|
80 |
| Models | Type | Average | STEM | Social Science | Humanities | Others |
|
81 |
| :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
|
82 |
| Baichuan-7B | pretrained | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 |
|
83 |
+
| Baichuan2-7B-Chat | fine-tuned | 53.2 | 43.1 | 59.1 | 50.0 | 59.1 |
|
84 |
| ChatGLM2-6B | pretrained | 45.5 | 40.1 | 51.6 | 41.2 | 51.2 |
|
85 |
| InternLM-7B | pretrained | 51.0 | **58.7** | 43.5 | **52.7** | 53.2 |
|
86 |
| LLaMA-7B | pretrained | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 |
|
|
|
95 |
| :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
|
96 |
| Baichuan-7B | pretrained | 42.8 | 38.2 | 52.0 | 46.2 | 39.3 |
|
97 |
| Baichuan2-7B-Base | pretrained | 54.9 | 47.9 | 67.3 | 58.4 | 52.8 |
|
98 |
+
| Baichuan2-7B-Chat | fine-tuned | 52.2 | 44.6 | 65.0 | 55.8 | 50.9 |
|
99 |
| ChatGLM2-6B | fine-tuned | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
|
100 |
| Falcon-7B | pretrained | 25.8 | 25.8 | 26.0 | 25.8 | 25.7 |
|
101 |
| InternLM-7B | pretrained | 52.4 | 47.0 | 64.9 | 55.6 | 47.6 |
|
102 |
+
| InternLM-7B-Chat | fine-tuned | 52.8 | 48.4 | 65.6 | 57.0 | 45.0 |
|
103 |
| LLaMA-7B | pretrained | 27.0 | 26.7 | 26.7 | 28.4 | 26.2 |
|
104 |
| LLaMA2-7B | pretrained | 28.9 | 26.8 | 34.5 | 30.0 | 26.4 |
|
105 |
| MPT-7B | pretrained | 27.8 | 27.4 | 29.8 | 26.9 | 27.7 |
|