Qwen
/

Qwen-7B-Chat

@@ -126,13 +126,13 @@ Our tokenizer based on tiktoken is different from other tokenizers, e.g., senten
 The details of the model architecture of Qwen-7B-Chat are listed as follows
-| Hyperparameter | Value |
-|:------|:------|
-| n_layers | 32 |
-| n_heads | 32 |
-| d_model | 4096 |
-| vocab size | 151851 |
-| sequence length | 2048 |
 在位置编码、FFN激活函数和normalization的实现方式上，我们也采用了目前最流行的做法，
 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm（可选安装flash-attention加速）。
@@ -165,28 +165,28 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
 We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set
-| Model | Avg. Acc. |
-|:--------------|:------:|
-| LLaMA2-7B-Chat | 31.9 |
-| LLaMA2-13B-Chat | 40.6 |
-| Chinese-Alpaca-2-7B | 41.3 |
-| Chinese-Alpaca-Plus-13B | 43.3 |
-| Baichuan-13B-Chat | 50.4 |
-| ChatGLM2-6B-Chat | 50.7 |
-| InternLM-7B-Chat | 53.2 |
-| **Qwen-7B-Chat** | **54.2** |
 C-Eval测试集上，Qwen-7B-Chat模型的zero-shot准确率结果如下：
 The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
-| Model | Avg. | STEM | Social Sciences | Humanities | Others |
-|:--------------|:------:|:------:|:------:|:------:|:------:|
-| Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
-| Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
-| ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
-| Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
-| **Qwen-7B-Chat** | **54.6** | 47.8 | 67.6 | 59.3 | 50.6 |
 在7B规模模型上，经过人类指令对齐的Qwen-7B-Chat模型，准确率在同类相近规模模型中仍然处于前列。
@@ -201,14 +201,14 @@ Compared with other pretrained models with comparable model size, the human-alig
 The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
 The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
-| Model | Avg. Acc. |
-|:--------------|:------:|
-| ChatGLM2-6B-Chat | 45.5 |
-| LLaMA2-7B-Chat | 47.0 |
-| InternLM-7B-Chat | 50.8 |
-| Baichuan-13B-Chat | 52.1 |
-| ChatGLM2-12B-Chat | 52.1 |
-| **Qwen-7B-Chat** | **53.9** |
 ### 代码评测（Coding Evaluation）
@@ -216,13 +216,13 @@ Qwen-7B-Chat在[HumanEval](https://github.com/openai/human-eval)的zero-shot Pas
 The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
-| Model | Pass@1 |
-|:--------------|:------:|
-| LLaMA2-7B-Chat | 12.2 |
-| InternLM-7B-Chat | 14.0 |
-| Baichuan-13B-Chat | 16.5 |
-| LLaMA2-13B-Chat | 18.9 |
-| **Qwen-7B-Chat** | **24.4** |
 ### 数学评测
@@ -230,15 +230,15 @@ The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/hu
 The accuracy of Qwen-7B-Chat on GSM8K is shown below
-| Model | Zero-shot Acc. | 4-shot Acc. |
-|:--------------|:------:|:------:|
-| ChatGLM2-6B-Chat |  -  | 28.0 |
-| LLaMA2-7B-Chat | 20.4 | 28.2 |
-| LLaMA2-13B-Chat | 29.4 | 36.7 |
-| InternLM-7B-Chat | 32.6 | 34.5 |
-| Baichuan-13B-Chat | -  | 36.3 |
-| ChatGLM2-12B-Chat | -  | 38.1 |
-| **Qwen-7B-Chat** | **41.1** | **43.5** |
 ### 长序列评测（Long-Context Understanding）
@@ -250,13 +250,13 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
 **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
-| Model | VCSUM (zh) |
-|:----------------|:-------:|
-| GPT-3.5-Turbo-16k | 16.0 |
-| LLama2-7B-Chat	|	0.2 |
-| InternLM-7B-Chat | 13.0 |
-| ChatGLM2-6B-Chat	| 16.3 |
-| **Qwen-7B-Chat** | **16.6** |
 ### 工具使用能力的评测（Tool Usage）
@@ -266,11 +266,11 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
 Qwen-7B-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat's performance is as follows:
-| Model            | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
-|:-----------------|:----------------------:|:---------------------:|:---------------------:|
-| GPT-4            | 95%                    | **0.90**              | 15%                   |
-| GPT-3.5          | 85%                    | 0.88                  | 75%                   |
-| **Qwen-7B-Chat** | **99%**                | 0.89                  | **9.7%**              |
 > 评测基准中出现的插件均没有出现在千问的训练集中。该基准评估了模型在多个候选插件中选择正确插件的准确率、传入插件的参数的合理性、以及假阳率。假阳率（False Positive）定义：在处理不该调用插件的请求时，错误地调用了插件。
@@ -289,12 +289,12 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct
 Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:
-| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-|:-|:-:|:-:|:-:|
-|GPT-4 | **100** | **100** | **97.41** |
-|GPT-3.5 | 95.37 | 96.30 | 87.04 |
-|StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
-| **Qwen-7B** | 90.74 | 92.59 | 74.07 |
 ## 量化（Quantization）
@@ -341,10 +341,10 @@ model = AutoModelForCausalLM.from_pretrained(
 With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
 | Precision | MMLU | Memory |
-| :---------| :-------: | :-----: |
-|   BF16   |  56.7 |   16.2G |
-|   Int8   |  52.8 |   10.1G |
-|    NF4    |  48.9 |    7.4G |
 ## 使用协议（License Agreement）
@@ -357,3 +357,4 @@ Our code and checkpoints are open to research purpose, and they are allowed for
 如果你想给我们的研发团队和产品团队留言，请通过邮件（[email protected]）联系我们。
 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

 The details of the model architecture of Qwen-7B-Chat are listed as follows
+| Hyperparameter  | Value  |
+| :-------------  | :----: |
+| n_layers        | 32     |
+| n_heads         | 32     |
+| d_model         | 4096   |
+| vocab size      | 151851 |
+| sequence length | 2048   |
 在位置编码、FFN激活函数和normalization的实现方式上，我们也采用了目前最流行的做法，
 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm（可选安装flash-attention加速）。
 We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set
+| Model                   | Avg. Acc. |
+| :---------------------- | :-------: |
+| LLaMA2-7B-Chat          |   31.9    |
+| LLaMA2-13B-Chat         |   40.6    |
+| Chinese-Alpaca-2-7B     |   41.3    |
+| Chinese-Alpaca-Plus-13B |   43.3    |
+| Baichuan-13B-Chat       |   50.4    |
+| ChatGLM2-6B-Chat        |   50.7    |
+| InternLM-7B-Chat        |   53.2    |
+| **Qwen-7B-Chat**        | **54.2**  |
 C-Eval测试集上，Qwen-7B-Chat模型的zero-shot准确率结果如下：
 The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
+| Model                   |   Avg.   | STEM | Social Sciences | Humanities | Others |
+| :---------------------- | :------: | :--: | :-------------: | :--------: | :----: |
+| Chinese-Alpaca-Plus-13B |   41.5   | 36.6 |      49.7       |    43.1    |  41.2  |
+| Chinese-Alpaca-2-7B     |   40.3   |  -   |        -        |     -      |   -    |
+| ChatGLM2-6B-Chat        |   50.1   | 46.4 |      60.4       |    50.6    |  46.9  |
+| Baichuan-13B-Chat       |   51.5   | 43.7 |      64.6       |    56.2    |  49.2  |
+| **Qwen-7B-Chat**        | **54.6** | 47.8 |      67.6       |    59.3    |  50.6  |
 在7B规模模型上，经过人类指令对齐的Qwen-7B-Chat模型，准确率在同类相近规模模型中仍然处于前列。
 The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
 The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
+| Model             | Avg. Acc. |
+| :---------------- | :-------: |
+| ChatGLM2-6B-Chat  |   45.5    |
+| LLaMA2-7B-Chat    |   47.0    |
+| InternLM-7B-Chat  |   50.8    |
+| Baichuan-13B-Chat |   52.1    |
+| ChatGLM2-12B-Chat |   52.1    |
+| **Qwen-7B-Chat**  | **53.9**  |
 ### 代码评测（Coding Evaluation）
 The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
+| Model             |  Pass@1  |
+| :---------------- | :------: |
+| LLaMA2-7B-Chat    |   12.2   |
+| InternLM-7B-Chat  |   14.0   |
+| Baichuan-13B-Chat |   16.5   |
+| LLaMA2-13B-Chat   |   18.9   |
+| **Qwen-7B-Chat**  | **24.4** |
 ### 数学评测
 The accuracy of Qwen-7B-Chat on GSM8K is shown below
+| Model             | Zero-shot Acc. | 4-shot Acc. |
+| :---------------- | :------------: | :--------:  |
+| ChatGLM2-6B-Chat  |       -        |    28.0     |
+| LLaMA2-7B-Chat    |      20.4      |    28.2     |
+| LLaMA2-13B-Chat   |      29.4      |    36.7     |
+| InternLM-7B-Chat  |      32.6      |    34.5     |
+| Baichuan-13B-Chat |       -        |    36.3     |
+| ChatGLM2-12B-Chat |       -        |    38.1     |
+| **Qwen-7B-Chat**  |    **41.1**    |  **43.5**   |
 ### 长序列评测（Long-Context Understanding）
 **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
+| Model             | VCSUM (zh) |
+| :---------------- | :--------: |
+| GPT-3.5-Turbo-16k |    16.0    |
+| LLama2-7B-Chat    |    0.2     |
+| InternLM-7B-Chat  |    13.0    |
+| ChatGLM2-6B-Chat  |    16.3    |
+| **Qwen-7B-Chat**  |  **16.6**  |
 ### 工具使用能力的评测（Tool Usage）
 Qwen-7B-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat's performance is as follows:
+| Model            | Tool Selection (Acc.↑)  | Tool Input (Rouge-L↑)  | False Positive Error”↓ |
+| :--------------- | :---------------------: | :--------------------: | :--------------------: |
+| GPT-4            |           95%           |        **0.90**        |          15%           |
+| GPT-3.5          |           85%           |          0.88          |          75%           |
+| **Qwen-7B-Chat** |         **99%**         |          0.89          |        **9.7%**        |
 > 评测基准中出现的插件均没有出现在千问的训练集中。该基准评估了模型在多个候选插件中选择正确插件的准确率、传入插件的参数的合理性、以及假阳率。假阳率（False Positive）定义：在处理不该调用插件的请求时，错误地调用了插件。
 Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:
+| Model           | Tool Selection↑ | Tool Used↑  |  Code↑    |
+| :-------------- | :-------------: | :---------: | :-------: |
+| GPT-4           |     **100**     |   **100**   | **97.41** |
+| GPT-3.5         |      95.37      |    96.30    |   87.04   |
+| StarCoder-15.5B |      87.04      |    87.96    |   68.89   |
+| **Qwen-7B**     |      90.74      |    92.59    |   74.07   |
 ## 量化（Quantization）
 With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
 | Precision | MMLU | Memory |
+| :-------- | :--: | :----: |
+| BF16      | 56.7 |  16.2G |
+| Int8      | 52.8 |  10.1G |
+| NF4       | 48.9 |  7.4G  |
 ## 使用协议（License Agreement）
 如果你想给我们的研发团队和产品团队留言，请通过邮件（[email protected]）联系我们。
 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].