Text Generation
Transformers
PyTorch
xverse
custom_code
pom commited on
Commit
9460c12
1 Parent(s): f34fb5c

update readme

Browse files
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -27,20 +27,22 @@ inference: false
27
 
28
  ## 评测结果
29
 
30
- 为验证模型的各项能力,我们选取了多个学科综合能力评测集,包括 [MMLU](https://arxiv.org/abs/2009.03300)(英文)、 [C-Eval](https://cevalbenchmark.com/)(中文)、[AGIEval](https://arxiv.org/abs/2304.06364)(中英) 、[GAOKAO-Bench](https://github.com/OpenLMLab/GAOKAO-Bench)(中英)、[GAOKAO-English](https://github.com/ExpressAI/AI-Gaokao)(英文),评测结果如下:
31
 
32
  | 模型 | 类型 | MMLU | C-Eval | AGIEval<sup>1</sup> | GAOKAO-Bench<sup>1</sup> | GAOKAO-English<sup>1</sup> |
33
  | :----------------: | :--: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
34
  | Baichuan-7B | 底座 | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
35
  | Baichuan2-7B-Base | 底座 | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
 
36
  | ChatGLM2-6B | 对话 | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
37
  | Falcon-7B | 底座 | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
38
  | InternLM-7B | 底座 | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
 
39
  | Llama-7B | 底座 | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
40
  | Llama-2-7B | 底座 | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
41
  | MPT-7B | 底座 | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
42
  | Vicuna-7B-v1.5 | 对话 | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
43
- | **XVERSE-7B** | 底座 | 56.6 | **57.1** | 46.9 | **61.7** | 71.1 |
44
 
45
  > <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
46
  > <sup>2:来源于各模型官方的汇报结果</sup>
@@ -55,14 +57,16 @@ In order to validate the various abilities of the model, we have chosen several
55
  | :----------------: | :--------: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
56
  | Baichuan-7B | pretrained | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
57
  | Baichuan2-7B-Base | pretrained | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
 
58
  | ChatGLM2-6B | fine-tuned | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
59
  | Falcon-7B | pretrained | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
60
  | InternLM-7B | pretrained | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
 
61
  | Llama-7B | pretrained | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
62
  | Llama-2-7B | pretrained | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
63
  | MPT-7B | pretrained | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
64
  | Vicuna-7B-v1.5 | fine-tuned | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
65
- | **XVERSE-7B** | pretrained | 56.6 | **57.1** | 46.9 | **61.7** | 71.1 |
66
 
67
  > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
68
  > <sup>2: Reporting results from official results of each model.</sup>
@@ -76,6 +80,7 @@ MMLU Category Results
76
  | Models | Type | Average | STEM | Social Science | Humanities | Others |
77
  | :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
78
  | Baichuan-7B | pretrained | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 |
 
79
  | ChatGLM2-6B | pretrained | 45.5 | 40.1 | 51.6 | 41.2 | 51.2 |
80
  | InternLM-7B | pretrained | 51.0 | **58.7** | 43.5 | **52.7** | 53.2 |
81
  | LLaMA-7B | pretrained | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 |
@@ -90,9 +95,11 @@ C-Eval Category Results
90
  | :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
91
  | Baichuan-7B | pretrained | 42.8 | 38.2 | 52.0 | 46.2 | 39.3 |
92
  | Baichuan2-7B-Base | pretrained | 54.9 | 47.9 | 67.3 | 58.4 | 52.8 |
 
93
  | ChatGLM2-6B | fine-tuned | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
94
  | Falcon-7B | pretrained | 25.8 | 25.8 | 26.0 | 25.8 | 25.7 |
95
  | InternLM-7B | pretrained | 52.4 | 47.0 | 64.9 | 55.6 | 47.6 |
 
96
  | LLaMA-7B | pretrained | 27.0 | 26.7 | 26.7 | 28.4 | 26.2 |
97
  | LLaMA2-7B | pretrained | 28.9 | 26.8 | 34.5 | 30.0 | 26.4 |
98
  | MPT-7B | pretrained | 27.8 | 27.4 | 29.8 | 26.9 | 27.7 |
 
27
 
28
  ## 评测结果
29
 
30
+ 为验证模型的各项能力,我们选取了多个学科综合能力评测集,包括 [MMLU](https://arxiv.org/abs/2009.03300)(英文)、 [C-Eval](https://cevalbenchmark.com/)(中文)、[AGIEval](https://arxiv.org/abs/2304.06364)(中英) 、[GAOKAO-Bench](https://github.com/OpenLMLab/GAOKAO-Bench)(中英)、[GAOKAO-English](https://github.com/ExpressAI/AI-Gaokao)(英文),评测结果如下(粗体表示各项最高得分):
31
 
32
  | 模型 | 类型 | MMLU | C-Eval | AGIEval<sup>1</sup> | GAOKAO-Bench<sup>1</sup> | GAOKAO-English<sup>1</sup> |
33
  | :----------------: | :--: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
34
  | Baichuan-7B | 底座 | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
35
  | Baichuan2-7B-Base | 底座 | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
36
+ | Baichuan2-7B-Chat | 对话 | 53.2 | 52.2 | 41.3 | 49.7 | 66.6 |
37
  | ChatGLM2-6B | 对话 | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
38
  | Falcon-7B | 底座 | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
39
  | InternLM-7B | 底座 | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
40
+ | InternLM-7B-Chat | 对话 | 50.8<sup>2</sup> | 52.8 | 39.0 | **67.4** | 43.9 |
41
  | Llama-7B | 底座 | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
42
  | Llama-2-7B | 底座 | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
43
  | MPT-7B | 底座 | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
44
  | Vicuna-7B-v1.5 | 对话 | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
45
+ | **XVERSE-7B** | 底座 | **56.6** | **57.1** | **46.9** | 61.7 | **71.1** |
46
 
47
  > <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
48
  > <sup>2:来源于各模型官方的汇报结果</sup>
 
57
  | :----------------: | :--------: | :--------------: | :--------------: | :-----------------: | :----------------------: | :------------------------: |
58
  | Baichuan-7B | pretrained | 42.3<sup>2</sup> | 42.8<sup>2</sup> | 34.4<sup>2</sup> | 36.3<sup>2</sup> | 44.3 |
59
  | Baichuan2-7B-Base | pretrained | 54.2<sup>2</sup> | 54.0<sup>2</sup> | 42.7<sup>2</sup> | 47.5<sup>2</sup> | 53.1 |
60
+ | Baichuan2-7B-Chat | fine-tuned | 53.2 | 52.2 | 41.3 | 49.7 | 66.6 |
61
  | ChatGLM2-6B | fine-tuned | 45.5<sup>2</sup> | 50.1<sup>2</sup> | 42.6 | 54.2 | 59.7 |
62
  | Falcon-7B | pretrained | 27.8<sup>2</sup> | 25.8 | 26.2 | 26.3 | 29.9 |
63
  | InternLM-7B | pretrained | 51.0<sup>2</sup> | 52.4 | 34.1 | 53.6 | 32.3 |
64
+ | InternLM-7B-Chat | fine-tuned | 50.8<sup>2</sup> | 52.8 | 39.0 | **67.4** | 43.9 |
65
  | Llama-7B | pretrained | 35.1<sup>2</sup> | 27.0 | 27.4 | 26.0 | 30.1 |
66
  | Llama-2-7B | pretrained | 45.3<sup>2</sup> | 28.9 | 27.0 | 27.8 | 47.8 |
67
  | MPT-7B | pretrained | 29.6<sup>2</sup> | 27.8 | 24.2 | 25.3 | 28.1 |
68
  | Vicuna-7B-v1.5 | fine-tuned | 49.8<sup>2</sup> | 22.9 | 26.7 | 24.4 | 61.1 |
69
+ | **XVERSE-7B** | pretrained | **56.6** | **57.1** | **46.9** | 61.7 | **71.1** |
70
 
71
  > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
72
  > <sup>2: Reporting results from official results of each model.</sup>
 
80
  | Models | Type | Average | STEM | Social Science | Humanities | Others |
81
  | :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
82
  | Baichuan-7B | pretrained | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 |
83
+ | Baichuan2-7B-Chat | fine-tuned | 53.2 | 43.1 | 59.1 | 50.0 | 59.1 |
84
  | ChatGLM2-6B | pretrained | 45.5 | 40.1 | 51.6 | 41.2 | 51.2 |
85
  | InternLM-7B | pretrained | 51.0 | **58.7** | 43.5 | **52.7** | 53.2 |
86
  | LLaMA-7B | pretrained | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 |
 
95
  | :----------------: | :--------: | :------: | :------: | :------------: | :--------: | :------: |
96
  | Baichuan-7B | pretrained | 42.8 | 38.2 | 52.0 | 46.2 | 39.3 |
97
  | Baichuan2-7B-Base | pretrained | 54.9 | 47.9 | 67.3 | 58.4 | 52.8 |
98
+ | Baichuan2-7B-Chat | fine-tuned | 52.2 | 44.6 | 65.0 | 55.8 | 50.9 |
99
  | ChatGLM2-6B | fine-tuned | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
100
  | Falcon-7B | pretrained | 25.8 | 25.8 | 26.0 | 25.8 | 25.7 |
101
  | InternLM-7B | pretrained | 52.4 | 47.0 | 64.9 | 55.6 | 47.6 |
102
+ | InternLM-7B-Chat | fine-tuned | 52.8 | 48.4 | 65.6 | 57.0 | 45.0 |
103
  | LLaMA-7B | pretrained | 27.0 | 26.7 | 26.7 | 28.4 | 26.2 |
104
  | LLaMA2-7B | pretrained | 28.9 | 26.8 | 34.5 | 30.0 | 26.4 |
105
  | MPT-7B | pretrained | 27.8 | 27.4 | 29.8 | 26.9 | 27.7 |