--- language: - zh - en tags: - qwen pipeline_tag: text-generation inference: false --- # Qwen-7B
Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  |  Demo  |  Report
For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration). For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the [tiktoken](https://github.com/openai/tiktoken) tokenizer library for efficient tokenization. We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above. As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-7B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages. For pre-training data, on the one hand, Qwen-7B uses part of the open-source generic corpus. On the other hand, it uses a massive amount of accumulated web corpus and high-quality text content. The scale of corpus reaches over 2.2T tokens after deduplication and filtration, encompassing web text, encyclopedias, books, code, mathematics, and various domain. ## 评测效果(Evaluation) ### 中文评测(Chinese Evaluation) #### C-Eval [C-Eval](https://arxiv.org/abs/2305.08322)是评测预训练模型中文常识能力的常用测评框架,覆盖人文、社科、理工、其他专业四个大方向共52个学科。 我们按照标准做法,以开发集样本作为few-shot来源,评价Qwen-7B预训练模型的5-shot验证集与测试集准确率。 [C-Eval](https://arxiv.org/abs/2305.08322) is a common evaluation benchmark for testing the common sense capability of pre-trained models in Chinese. It covers 52 subjects in four major directions: humanities, social sciences, STEM, and other specialties. According to the standard practice, we use the development set samples as the source of few-shot, to evaluate the 5-shot validation set and test set accuracy of the Qwen-7B pre-trained model. 在C-Eval验证集上,Qwen-7B模型和其他模型的准确率对比如下: The accuracy comparison of Qwen-7B and the other models on the C-Eval validation set is shown as follows: | Model | Avg. | |:---------------:|---------:| | Alpaca-7B | 28.9 | | Vicuna-7B | 31.2 | | ChatGLM-6B | 37.1 | | Baichuan-7B | 42.7 | | ChatGLM2-6B | 50.9 | | InternLM-7B | 53.4 | | ChatGPT | 53.5 | | Claude-v1.3 | 55.5 | | **Qwen-7B** | **60.8** | 在C-Eval测试集上,Qwen-7B预训练模型与其他模型的效果对比如下表所示: The performance comparison of Qwen-7B and other models on the C-Eval test set is shown in the following table: | Model | Avg. | Avg. (Hard) | STEM | Social Sciences | Humanities | Others | |:--------------:|------:|------:|------:|------:|------:|------:| | ChatGLM-6B | 38.9 | 29.2 | 33.3 | 48.3 | 41.3 | 38.0 | | Chinese-Alpaca-Plus-13B | 41.5 | 30.5 | 36.6 | 49.7 | 43.1 | 41.2 | | Baichuan-7B | 42.8 | 31.5 | 38.2 | 52.0 | 46.2 | 39.3 | | WestlakeLM-19B | 44.6 | 34.9 | 41.6 | 51.0 | 44.3 | 44.5 | | AndesLM-13B | 46.0 | 29.7 | 38.1 | 61.0 | 51.0 | 41.9 | | BatGPT-15B-sirius | 47.0 | 31.9 | 42.7 | 57.5 | 48.6 | 43.6 | | ChatGLM2-6B | 51.7 | 37.1 | 48.6 | 60.5 | 51.3 | 49.8 | | InternLM-7B | 52.8 | 37.1 | 48.0 | 67.4 | 55.4 | 45.8 | | Baichuan-13B | 53.6 | 36.7 | 47.0 | 66.8 | 57.3 | 49.8 | | Claude-v1.3 | 54.2 | 39.0 | 51.9 | 61.7 | 52.1 | 53.7 | | ChatGPT | 54.4 | 41.4 | 52.9 | 61.8 | 50.9 | 53.6 | | **Qwen-7B** | **59.6** | 41.0 | 52.8 | 74.1 | 63.1 | 55.2 | 可以看到,Qwen-7B在同等规模现有模型中取得了最高的分数,甚至相比更大规模模型也具有较强竞争力。 As can be seen, Qwen-7B achieves the best performance out of all existing models with similar scale and even surpasses larger-scale models. ### 英文评测(English Evaluation) #### MMLU [MMLU](https://arxiv.org/abs/2009.03300)是目前评测英文综合能力最权威的基准评测之一,同样覆盖了不同学科领域、不同难度层级的57个子任务。 Qwen-7B在MMLU 5-shot准确率表现如下表: [MMLU](https://arxiv.org/abs/2009.03300) is currently one of the most recognized benchmarks for evaluating English comprehension abilities, covering 57 subtasks across different academic fields and difficulty levels. The MMLU 5-shot accuracy performance of Qwen-7B is shown in the following table: | Model | Avg. | STEM | Social Sciences | Humanities | Others | |:--------------:|------:|------:|------:|------:|------:| | LLaMA-7B | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 | | Baichuan-7B | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 | | LLaMA2-7B | 45.3 | 36.4 | 51.2 | 42.9 | 52.2 | | LLaMA-13B | 46.9 | 35.8 | 53.8 | 45.0 | 53.3 | | ChatGLM2-6B | 47.9 | 41.2 | 54.4 | 43.7 | 54.5 | | InternLM-7B | 51.0 | - | - | - | - | | Baichuan-13B | 51.6 | 41.6 | 60.9 | 47.4 | 58.5 | | LLaMA2-13B | 54.8 | 44.1 | 62.6 | 52.8 | 61.1 | | ChatGLM2-12B | 56.2 | 48.2 | 65.1 | 52.6 | 60.9 | | **Qwen-7B** | **56.7** | 47.6 | 65.9 | 51.5 | 64.7 | 在英文方面,Qwen-7B的效果同样超过了目前国内外其他同类开源预训练模型,同样对比更大规模版本的模型也具有较强竞争力。 In terms of English, Qwen-7B also surpasses other similar open-source pre-trained models, and is competitive when compared to larger versions of other models. ### 代码评测(Coding Evaluation) 我们在[HumanEval](https://github.com/openai/human-eval)(0-shot)上对比预训练模型的代码能力,结果如下: We compared the code capabilities of pre-trained models on [HumanEval](https://github.com/openai/human-eval), and the results are as follows: | Model | Pass@1 | |:--------------:|------:| | Baichuan-7B | 9.2 | | ChatGLM2-6B | 9.2 | | InternLM-7B | 10.4 | | LLaMA-7B | 10.5 | | LLaMA2-7B | 12.8 | | Baichuan-13B | 12.8 | | LLaMA-13B | 15.8 | | MPT-7B | 18.3 | | LLaMA2-13B | 18.3 | | **Qwen-7B** | **24.4** | ### 数学评测(Mathematics Evaluation) 数学能力使用常用的[GSM8K](https://github.com/openai/grade-school-math)数据集(8-shot)评价: We compared the math capabilities of pre-trained models on [GSM8K](https://github.com/openai/grade-school-math) (8-shot), and the results are as follows: | Model | Acc. | |:--------------:|------:| | MPT-7B | 6.8 | | Falcon-7B | 6.8 | | Baichuan-7B | 9.7 | | LLaMA-7B | 11.0 | | LLaMA2-7B | 14.6 | | LLaMA-13B | 17.8 | | Baichuan-13B | 26.6 | | LLaMA2-13B | 28.7 | | InternLM-7B | 31.2 | | ChatGLM2-6B | 32.4 | | ChatGLM2-12B | 40.9 | | **Qwen-7B** | **51.6** | ### 翻译评测 我们使用[WMT22](https://www.statmt.org/wmt22/translation-task.html)中-英(zh-en)和英-中(en-zh)数据集(5-shot BLEU)评测: We compared the translation capabilities of pre-trained models on [WMT22](https://www.statmt.org/wmt22/translation-task.html) zh-en and en-zh (5-shot BLEU), and the results are as follows: | Model | Avg. | zh-en | en-zh | |:-----------:|---------:|---------:|---------:| | InternLM-7B | 11.8 | 9.0 | 14.5 | | LLaMA-7B | 12.7 | 16.7 | 8.7 | | LLaMA-13B | 15.8 | 19.5 | 12.0 | | LLaMA2-7B | 19.9 | 21.9 | 17.9 | | Bloom-7B | 20.3 | 19.1 | 21.4 | | LLaMA2-13B | 23.3 | 22.4 | 24.2 | | PolyLM-13B | 23.6 | 20.2 | 27.0 | | Baichuan-7B | 24.6 | 22.6 | 26.6 | | **Qwen-7B** | **27.5** | **24.3** | **30.6** | ### 长序列评测(Long-Context Evaluation) 我们引入NTK插值,LogN注意力缩放,窗口注意力等技巧,将模型的上下文长度扩展到8K以上。在arXiv数据上使用PPL指标测试Qwen-7B在不同长度下的表现,结果如下: **(若要启用NTK和LogN注意力缩放,请将config.json里的`use_dynamc_ntk`和`use_logn_attn`设置为true)** We introduce NTK-aware interpolation, LogN attention scaling, Window attention, etc. to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation. Results are demonstrated below: **(To use NTK interpolation and LogN scaling, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
Model | 序列长度 Sequence Length | ||||
---|---|---|---|---|---|
1024 | 2048 | 4096 | 8192 | 16384 | |
Qwen-7B | 4.23 | 3.78 | 39.35 | 469.81 | 2645.09 |
+ dynamic_ntk | 4.23 | 3.78 | 3.59 | 3.66 | 5.71 |
+ dynamic_ntk + logn | 4.23 | 3.78 | 3.58 | 3.56 | 4.62 |
+ dynamic_ntk + logn + window_attn | 4.23 | 3.78 | 3.58 | 3.49 | 4.32 |