Update README.md
Browse files
README.md
CHANGED
@@ -36,6 +36,55 @@ KeyError: 'qwen2'
|
|
36 |
We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
|
37 |
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Citation
|
40 |
|
41 |
If you find our work helpful, feel free to give us a cite.
|
|
|
36 |
We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
|
37 |
|
38 |
|
39 |
+
## Performance
|
40 |
+
|
41 |
+
The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.
|
42 |
+
|
43 |
+
The datasets for evaluation include:
|
44 |
+
|
45 |
+
**English Tasks**: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
|
46 |
+
|
47 |
+
**Coding Tasks**: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
|
48 |
+
|
49 |
+
**Math Tasks**: GSM8K (4-shot), MATH (4-shot)
|
50 |
+
|
51 |
+
**Chinese Tasks**: C-Eval (5-shot), CMMLU (5-shot)
|
52 |
+
|
53 |
+
**Multilingual Tasks**: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)
|
54 |
+
|
55 |
+
#### Qwen2-72B performance
|
56 |
+
| Datasets | DeepSeek-V2 | Mixtral-8x22B | Llama-3-70B | Qwen1.5-72B | Qwen1.5-110B | **Qwen2-72B** |
|
57 |
+
| :--------| :---------: | :------------: | :------------: | :------------: | :------------: |:------------: |
|
58 |
+
|Architecture | MoE | MoE | Dense | Dense | Dense | Dense |
|
59 |
+
|#Activated Params | 21B | 39B | 70B | 72B | 110B | 72B |
|
60 |
+
|#Params | 236B | 140B | 70B | 72B | 110B | 72B|
|
61 |
+
| ***English*** | | | | | | |
|
62 |
+
|MMLU |78.5 | 77.8 | 79.5 | 77.5 | 80.4 | **84.2** |
|
63 |
+
|MMLU-Pro | - | 49.5 | 52.8 | 45.8 | 49.4 | **55.6** |
|
64 |
+
|GPQA | -| 34.3 | 36.3 | 36.3 | 35.9 | **37.9** |
|
65 |
+
|Theorem QA | -| 35.9 | 32.3 | 29.3 | 34.9 | **43.1** |
|
66 |
+
|BBH | 78.9 |78.9 | 81.0 | 65.5 | 74.8 | **82.4** |
|
67 |
+
|HellaSwag | 87.8 | **88.7** | 88.0 | 86.0 | 87.5 | 87.6 |
|
68 |
+
|WindoGrande | 84.8|85.0 | **85.3** | 83.0 | 83.5 | 85.1 |
|
69 |
+
|ARC-C | 70.0| **70.7** | 68.8 | 65.9 | 69.6 | 68.9 |
|
70 |
+
|TruthfulQA | 42.2 | 51.0 | 45.6 | **59.6** | 49.6 | 54.8 |
|
71 |
+
| ***Coding*** | | | | | | |
|
72 |
+
|HumanEval | 45.7 | 46.3 | 48.2 | 46.3 | 54.3 | **64.6** |
|
73 |
+
|MBPP |73.9 | 71.7 | 70.4 | 66.9 | 70.9 | **76.9** |
|
74 |
+
|EvalPlus | 55.0 | 54.1 | 54.8 | 52.9 | 57.7 | **65.4** |
|
75 |
+
|MultiPL-E |44.4 | 46.7 | 46.3 | 41.8 | 52.7 | **59.6** |
|
76 |
+
| ***Mathematics*** | | | | | | |
|
77 |
+
|GSM8K | 79.2 | 83.7 | 83.0 | 79.5 | 85.4 | **89.5** |
|
78 |
+
|MATH | 43.6 | 41.7 | 42.5 | 34.1 | 49.6 | **51.1** |
|
79 |
+
| ***Chinese*** | | | | | | |
|
80 |
+
|C-Eval | 81.7 | 54.6 | 65.2 | 84.1 | 89.1 | **91.0** |
|
81 |
+
|CMMLU | 84.0 | 53.4 | 67.2 | 83.5 | 88.3 | **90.1** |
|
82 |
+
| ***Multilingual*** | | | | | | |
|
83 |
+
|Mulit-Exam | 67.5 | 63.5 | 70.0 | 66.4 | 75.6 | **76.6** |
|
84 |
+
|Multi-Understanding | 77.0 | 77.7 | 79.9 | 78.2 | 78.2 | **80.7** |
|
85 |
+
|Multi-Mathematics | 58.8 | 62.9 | 67.1 | 61.7 | 64.4 | **76.0** |
|
86 |
+
|Multi-Translation | 36.0 | 23.3 | **38.0** | 35.6 | 36.2 | 37.8 |
|
87 |
+
|
88 |
## Citation
|
89 |
|
90 |
If you find our work helpful, feel free to give us a cite.
|