Qwen
/

yangapku commited on
Commit
e0b4ff6
1 Parent(s): eec7863

update readme

Browse files
Files changed (1) hide show
  1. README.md +71 -70
README.md CHANGED
@@ -126,13 +126,13 @@ Our tokenizer based on tiktoken is different from other tokenizers, e.g., senten
126
 
127
  The details of the model architecture of Qwen-7B-Chat are listed as follows
128
 
129
- | Hyperparameter | Value |
130
- |:------|:------|
131
- | n_layers | 32 |
132
- | n_heads | 32 |
133
- | d_model | 4096 |
134
- | vocab size | 151851 |
135
- | sequence length | 2048 |
136
 
137
  在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法,
138
  即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。
@@ -165,28 +165,28 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
165
 
166
  We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set
167
 
168
- | Model | Avg. Acc. |
169
- |:--------------|:------:|
170
- | LLaMA2-7B-Chat | 31.9 |
171
- | LLaMA2-13B-Chat | 40.6 |
172
- | Chinese-Alpaca-2-7B | 41.3 |
173
- | Chinese-Alpaca-Plus-13B | 43.3 |
174
- | Baichuan-13B-Chat | 50.4 |
175
- | ChatGLM2-6B-Chat | 50.7 |
176
- | InternLM-7B-Chat | 53.2 |
177
- | **Qwen-7B-Chat** | **54.2** |
178
 
179
  C-Eval测试集上,Qwen-7B-Chat模型的zero-shot准确率结果如下:
180
 
181
  The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
182
 
183
- | Model | Avg. | STEM | Social Sciences | Humanities | Others |
184
- |:--------------|:------:|:------:|:------:|:------:|:------:|
185
- | Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
186
- | Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
187
- | ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
188
- | Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
189
- | **Qwen-7B-Chat** | **54.6** | 47.8 | 67.6 | 59.3 | 50.6 |
190
 
191
  在7B规模模型上,经过人类指令对齐的Qwen-7B-Chat模型,准确率在同类相近规模模型中仍然处于前列。
192
 
@@ -201,14 +201,14 @@ Compared with other pretrained models with comparable model size, the human-alig
201
  The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
202
  The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
203
 
204
- | Model | Avg. Acc. |
205
- |:--------------|:------:|
206
- | ChatGLM2-6B-Chat | 45.5 |
207
- | LLaMA2-7B-Chat | 47.0 |
208
- | InternLM-7B-Chat | 50.8 |
209
- | Baichuan-13B-Chat | 52.1 |
210
- | ChatGLM2-12B-Chat | 52.1 |
211
- | **Qwen-7B-Chat** | **53.9** |
212
 
213
  ### 代码评测(Coding Evaluation)
214
 
@@ -216,13 +216,13 @@ Qwen-7B-Chat在[HumanEval](https://github.com/openai/human-eval)的zero-shot Pas
216
 
217
  The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
218
 
219
- | Model | Pass@1 |
220
- |:--------------|:------:|
221
- | LLaMA2-7B-Chat | 12.2 |
222
- | InternLM-7B-Chat | 14.0 |
223
- | Baichuan-13B-Chat | 16.5 |
224
- | LLaMA2-13B-Chat | 18.9 |
225
- | **Qwen-7B-Chat** | **24.4** |
226
 
227
  ### 数学评测
228
 
@@ -230,15 +230,15 @@ The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/hu
230
 
231
  The accuracy of Qwen-7B-Chat on GSM8K is shown below
232
 
233
- | Model | Zero-shot Acc. | 4-shot Acc. |
234
- |:--------------|:------:|:------:|
235
- | ChatGLM2-6B-Chat | - | 28.0 |
236
- | LLaMA2-7B-Chat | 20.4 | 28.2 |
237
- | LLaMA2-13B-Chat | 29.4 | 36.7 |
238
- | InternLM-7B-Chat | 32.6 | 34.5 |
239
- | Baichuan-13B-Chat | - | 36.3 |
240
- | ChatGLM2-12B-Chat | - | 38.1 |
241
- | **Qwen-7B-Chat** | **41.1** | **43.5** |
242
 
243
  ### 长序列评测(Long-Context Understanding)
244
 
@@ -250,13 +250,13 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
250
 
251
  **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
252
 
253
- | Model | VCSUM (zh) |
254
- |:----------------|:-------:|
255
- | GPT-3.5-Turbo-16k | 16.0 |
256
- | LLama2-7B-Chat | 0.2 |
257
- | InternLM-7B-Chat | 13.0 |
258
- | ChatGLM2-6B-Chat | 16.3 |
259
- | **Qwen-7B-Chat** | **16.6** |
260
 
261
  ### 工具使用能力的评测(Tool Usage)
262
 
@@ -266,11 +266,11 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
266
 
267
  Qwen-7B-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat's performance is as follows:
268
 
269
- | Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error |
270
- |:-----------------|:----------------------:|:---------------------:|:---------------------:|
271
- | GPT-4 | 95% | **0.90** | 15% |
272
- | GPT-3.5 | 85% | 0.88 | 75% |
273
- | **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** |
274
 
275
  > 评测基准中出现的插件均没有出现在千问的训练集中。该基准评估了模型在多个候选插件中选择正确插件的准确率、传入插件的参数的合理性、以及假阳率。假阳率(False Positive)定义:在处理不该调用插件的请求时,错误地调用了插件。
276
 
@@ -289,12 +289,12 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct
289
 
290
  Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:
291
 
292
- | Model | Tool Selection↑ | Tool Used↑ | Code↑ |
293
- |:-|:-:|:-:|:-:|
294
- |GPT-4 | **100** | **100** | **97.41** |
295
- |GPT-3.5 | 95.37 | 96.30 | 87.04 |
296
- |StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
297
- | **Qwen-7B** | 90.74 | 92.59 | 74.07 |
298
 
299
  ## 量化(Quantization)
300
 
@@ -341,10 +341,10 @@ model = AutoModelForCausalLM.from_pretrained(
341
  With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
342
 
343
  | Precision | MMLU | Memory |
344
- | :---------| :-------: | :-----: |
345
- | BF16 | 56.7 | 16.2G |
346
- | Int8 | 52.8 | 10.1G |
347
- | NF4 | 48.9 | 7.4G |
348
 
349
  ## 使用协议(License Agreement)
350
 
@@ -357,3 +357,4 @@ Our code and checkpoints are open to research purpose, and they are allowed for
357
  如果你想给我们的研发团队和产品团队留言,请通过邮件([email protected])联系我们。
358
 
359
  If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
 
 
126
 
127
  The details of the model architecture of Qwen-7B-Chat are listed as follows
128
 
129
+ | Hyperparameter | Value |
130
+ | :------------- | :----: |
131
+ | n_layers | 32 |
132
+ | n_heads | 32 |
133
+ | d_model | 4096 |
134
+ | vocab size | 151851 |
135
+ | sequence length | 2048 |
136
 
137
  在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法,
138
  即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。
 
165
 
166
  We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set
167
 
168
+ | Model | Avg. Acc. |
169
+ | :---------------------- | :-------: |
170
+ | LLaMA2-7B-Chat | 31.9 |
171
+ | LLaMA2-13B-Chat | 40.6 |
172
+ | Chinese-Alpaca-2-7B | 41.3 |
173
+ | Chinese-Alpaca-Plus-13B | 43.3 |
174
+ | Baichuan-13B-Chat | 50.4 |
175
+ | ChatGLM2-6B-Chat | 50.7 |
176
+ | InternLM-7B-Chat | 53.2 |
177
+ | **Qwen-7B-Chat** | **54.2** |
178
 
179
  C-Eval测试集上,Qwen-7B-Chat模型的zero-shot准确率结果如下:
180
 
181
  The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
182
 
183
+ | Model | Avg. | STEM | Social Sciences | Humanities | Others |
184
+ | :---------------------- | :------: | :--: | :-------------: | :--------: | :----: |
185
+ | Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
186
+ | Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
187
+ | ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
188
+ | Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
189
+ | **Qwen-7B-Chat** | **54.6** | 47.8 | 67.6 | 59.3 | 50.6 |
190
 
191
  在7B规模模型上,经过人类指令对齐的Qwen-7B-Chat模型,准确率在同类相近规模模型中仍然处于前列。
192
 
 
201
  The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
202
  The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
203
 
204
+ | Model | Avg. Acc. |
205
+ | :---------------- | :-------: |
206
+ | ChatGLM2-6B-Chat | 45.5 |
207
+ | LLaMA2-7B-Chat | 47.0 |
208
+ | InternLM-7B-Chat | 50.8 |
209
+ | Baichuan-13B-Chat | 52.1 |
210
+ | ChatGLM2-12B-Chat | 52.1 |
211
+ | **Qwen-7B-Chat** | **53.9** |
212
 
213
  ### 代码评测(Coding Evaluation)
214
 
 
216
 
217
  The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
218
 
219
+ | Model | Pass@1 |
220
+ | :---------------- | :------: |
221
+ | LLaMA2-7B-Chat | 12.2 |
222
+ | InternLM-7B-Chat | 14.0 |
223
+ | Baichuan-13B-Chat | 16.5 |
224
+ | LLaMA2-13B-Chat | 18.9 |
225
+ | **Qwen-7B-Chat** | **24.4** |
226
 
227
  ### 数学评测
228
 
 
230
 
231
  The accuracy of Qwen-7B-Chat on GSM8K is shown below
232
 
233
+ | Model | Zero-shot Acc. | 4-shot Acc. |
234
+ | :---------------- | :------------: | :--------: |
235
+ | ChatGLM2-6B-Chat | - | 28.0 |
236
+ | LLaMA2-7B-Chat | 20.4 | 28.2 |
237
+ | LLaMA2-13B-Chat | 29.4 | 36.7 |
238
+ | InternLM-7B-Chat | 32.6 | 34.5 |
239
+ | Baichuan-13B-Chat | - | 36.3 |
240
+ | ChatGLM2-12B-Chat | - | 38.1 |
241
+ | **Qwen-7B-Chat** | **41.1** | **43.5** |
242
 
243
  ### 长序列评测(Long-Context Understanding)
244
 
 
250
 
251
  **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
252
 
253
+ | Model | VCSUM (zh) |
254
+ | :---------------- | :--------: |
255
+ | GPT-3.5-Turbo-16k | 16.0 |
256
+ | LLama2-7B-Chat | 0.2 |
257
+ | InternLM-7B-Chat | 13.0 |
258
+ | ChatGLM2-6B-Chat | 16.3 |
259
+ | **Qwen-7B-Chat** | **16.6** |
260
 
261
  ### 工具使用能力的评测(Tool Usage)
262
 
 
266
 
267
  Qwen-7B-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat's performance is as follows:
268
 
269
+ | Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error”↓ |
270
+ | :--------------- | :---------------------: | :--------------------: | :--------------------: |
271
+ | GPT-4 | 95% | **0.90** | 15% |
272
+ | GPT-3.5 | 85% | 0.88 | 75% |
273
+ | **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** |
274
 
275
  > 评测基准中出现的插件均没有出现在千问的训练集中。该基准评估了模型在多个候选插件中选择正确插件的准确率、传入插件的参数的合理性、以及假阳率。假阳率(False Positive)定义:在处理不该调用插件的请求时,错误地调用了插件。
276
 
 
289
 
290
  Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:
291
 
292
+ | Model | Tool Selection↑ | Tool Used↑ | Code↑ |
293
+ | :-------------- | :-------------: | :---------: | :-------: |
294
+ | GPT-4 | **100** | **100** | **97.41** |
295
+ | GPT-3.5 | 95.37 | 96.30 | 87.04 |
296
+ | StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
297
+ | **Qwen-7B** | 90.74 | 92.59 | 74.07 |
298
 
299
  ## 量化(Quantization)
300
 
 
341
  With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
342
 
343
  | Precision | MMLU | Memory |
344
+ | :-------- | :--: | :----: |
345
+ | BF16 | 56.7 | 16.2G |
346
+ | Int8 | 52.8 | 10.1G |
347
+ | NF4 | 48.9 | 7.4G |
348
 
349
  ## 使用协议(License Agreement)
350
 
 
357
  如果你想给我们的研发团队和产品团队留言,请通过邮件([email protected])联系我们。
358
 
359
  If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
360
+