马 仕镕 commited on
Commit
96910ac
1 Parent(s): 807017b

Update README

Browse files
Files changed (2) hide show
  1. README.md +19 -9
  2. figures/arena3.png +0 -0
README.md CHANGED
@@ -66,28 +66,38 @@ DeepSeek-V2-Chat-0628 is an improved version of DeepSeek-V2-Chat. For model deta
66
 
67
  DeepSeek-V2-Chat-0628 has achieved remarkable performance on the LMSYS Chatbot Arena Leaderboard:
68
 
69
- - Overall Ranking: #11, outperforming all other open-source models.
70
- - Coding Arena Ranking: #3, showcasing exceptional capabilities in coding tasks.
71
- - Hard Prompts Arena Ranking: #3, demonstrating strong performance on challenging prompts.
72
 
73
  <p align="center">
74
  <img width="90%" src="figures/arena1.png" />
75
  </p>
76
 
 
 
77
  <p align="center">
78
  <img width="90%" src="figures/arena2.png" />
79
  </p>
80
 
 
 
 
 
 
 
81
  ## 2. Improvement
82
 
83
  Compared to the previous version DeepSeek-V2-Chat, the new version has made the following improvements:
84
 
85
- - Code: HumanEval Pass@1 increased from 79.88% to 84.76%.
86
- - Mathematics: MATH ACC@1 improved from 55.02% to 71.02%.
87
- - Reasoning: Big-Bench-Hard(BBH) improved from 78.56% to 83.40%.
88
- - Instruction Following: IFEval Benchmark Prompt-Level accuracy improved from 63.9% to 77.6%.
89
- - JSON Format Output: Internal test set performance increased from 78% to 85%.
90
- - Additionally, in the Arena-Hard evaluation, the win rate against GPT-4-0314 has increased from 41.6% to 68.3%. Furthermore, the instruction following capability in the "system" area has been optimized, significantly enhancing the user experience for immersive translation, RAG, and other tasks.
 
 
 
 
91
 
92
  ## 3. How to run locally
93
 
 
66
 
67
  DeepSeek-V2-Chat-0628 has achieved remarkable performance on the LMSYS Chatbot Arena Leaderboard:
68
 
69
+ Overall Ranking: #11, outperforming all other open-source models.
 
 
70
 
71
  <p align="center">
72
  <img width="90%" src="figures/arena1.png" />
73
  </p>
74
 
75
+ Coding Arena Ranking: #3, showcasing exceptional capabilities in coding tasks.
76
+
77
  <p align="center">
78
  <img width="90%" src="figures/arena2.png" />
79
  </p>
80
 
81
+ Hard Prompts Arena Ranking: #3, demonstrating strong performance on challenging prompts.
82
+
83
+ <p align="center">
84
+ <img width="90%" src="figures/arena3.png" />
85
+ </p>
86
+
87
  ## 2. Improvement
88
 
89
  Compared to the previous version DeepSeek-V2-Chat, the new version has made the following improvements:
90
 
91
+ | **Benchmark** | **DeepSeek-V2-Chat** | **DeepSeek-V2-Chat-0628** | **Improvement** |
92
+ |:-----------:|:------------:|:---------------:|:-------------------------:|
93
+ | **HumanEval** | 81.1 | 84.8 | +3.7 |
94
+ | **MATH** | 53.9 | 71.0 | +17.1 |
95
+ | **BBH** | 79.7 | 83.4 | +3.7 |
96
+ | **IFEval** | 63.8 | 77.6 | +13.8 |
97
+ | **Arena-Hard** | 41.6 | 68.3 | +26.7 |
98
+ | **JSON Output (Internal)** | 78 | 85 | +7 |
99
+
100
+ Furthermore, the instruction following capability in the "system" area has been optimized, significantly enhancing the user experience for immersive translation, RAG, and other tasks.
101
 
102
  ## 3. How to run locally
103
 
figures/arena3.png ADDED