yuyijiong
/

Qwen-14b-chat-yarn-32k

@@ -11,6 +11,7 @@ pipeline_tag: text-generation
 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
 * 2023.12.23更新：发布LongBench的passage_retrieval_en的评测结果
 * 2023.12.16更新：发布[论文(中文版)](https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)、[论文(英文版)](https://arxiv.org/abs/2312.11193)
 * 2023.12.14更新：发布经过微调的Qwen-14b-chat-yarn-32k，微调后的模型能适应32k长度（约4万汉字）的中英问答，相较于之前的通过位置插值得到的32k模型，几乎完全解决了多文档问答任务下召回率低（即 lost in middle 现象）的问题。
@@ -22,23 +23,28 @@ pipeline_tag: text-generation
 # LongBench测试结果
 ### LongBench的passage_retrieval_zh的评测结果
-| 模型                           | 得分 (acc) |
-|------------------------------|----------|
-| **Qwen-14b-chat-yarn-32k**   |**0.94**|
-|gpt-3.5-turbo-16k | 0.81 |
-| chatglm3-32k                 | 0.725    |
-| Qwen-14b-chat (use_dynamic_ntk=True) | 0.525    |
-| Qwen-14b-chat-32k-lora       | 0.34     |
-| LongAlpaca-7b-32k-chinese-v2 | 0.12     |
-| CausalLM-14b                 | 0.086    |
 ###  LongBench的passage_retrieval_en的评测结果
-| 模型                     | 得分 (acc) |
-|------------------------|----------|
-| **Qwen-14b-chat-yarn-32k** | **0.945**    |
-| Qwen-14b-chat          | 0.24    |
-| chatglm3-32k           | 0.815    |
-| gpt-3.5-turbo-16k      | 0.88     |
 Qwen-14b-chat-yarn-32k经过微调后，在多文档问答（或检索）任务上提升非常显著，大幅领先其他同规模的模型。

 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
+* 2023.12.28更新：发布Qwen-7b-chat-yarn-32k，但注意，可能由于模型规模偏小，基座模型能力弱，导致7b版本显著弱于Qwen-14b-chat-yarn-32k
 * 2023.12.23更新：发布LongBench的passage_retrieval_en的评测结果
 * 2023.12.16更新：发布[论文(中文版)](https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)、[论文(英文版)](https://arxiv.org/abs/2312.11193)
 * 2023.12.14更新：发布经过微调的Qwen-14b-chat-yarn-32k，微调后的模型能适应32k长度（约4万汉字）的中英问答，相较于之前的通过位置插值得到的32k模型，几乎完全解决了多文档问答任务下召回率低（即 lost in middle 现象）的问题。
 # LongBench测试结果
 ### LongBench的passage_retrieval_zh的评测结果
+| 模型                           | 得分 (acc)   |
+|------------------------------|------------|
+| **Qwen-14b-chat-yarn-32k**   | **0.94**   |
+| gpt-3.5-turbo-16k            | 0.81       |
+| chatglm3-32k                 | 0.725      |
+| Qwen-14b-chat                | 0.525      |
+| Qwen-14b-chat-32k-lora       | 0.34       |
+| **Qwen-7b-chat-yarn-32k**    | **0.325**  |
+| Qwen-7b-chat                 | 0.26       |
+| LongAlpaca-7b-32k-chinese-v2 | 0.12       |
+| CausalLM-14b                 | 0.086      |
 ###  LongBench的passage_retrieval_en的评测结果
+| 模型                          | 得分 (acc)   |
+|-----------------------------|------------|
+| **Qwen-14b-chat-yarn-32k**  | **0.945**  |
+| chatglm3-32k                | 0.815      |
+| gpt-3.5-turbo-16k           | 0.88       |
+| **Qwen-7b-chat-yarn-32k**   | **0.47**   |
+| Qwen-14b-chat               | 0.24       |
+| Qwen-7b-chat                | 0.235      |
 Qwen-14b-chat-yarn-32k经过微调后，在多文档问答（或检索）任务上提升非常显著，大幅领先其他同规模的模型。

README_en.md CHANGED Viewed

@@ -11,6 +11,7 @@ pipeline_tag: text-generation
 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
 * Updated on December 23, 2023: Release the evaluation results of passage_retrieval_en in LongBench
 * Updated on December 16, 2023: Release [Paper](https://arxiv.org/abs/2312.11193)
 * Updated on December 14, 2023: We have released the Qwen-14b-chat-yarn-32k model, which has been fine-tuned to handle Chinese and English question-answering tasks with a length of up to 32k (approximately 40,000 Chinese characters). This model addresses the low recall issue in multi-document question-answering tasks (also known as the "lost in middle" phenomenon) that was present in the previous 32k model obtained through position interpolation. <br>
@@ -21,23 +22,27 @@ pipeline_tag: text-generation
 # Evaluation results in LongBench
 ### Evaluation results for passage_retrieval_zh in LongBench
-| Models                       | Accuracy |
-|------------------------------|----------|
-| **Qwen-14b-chat-yarn-32k**   | **0.94** |
-| gpt-3.5-turbo-16k            | 0.81     |
-| chatglm3-32k                 | 0.725    |
-| Qwen-14b-chat                | 0.525    |
-| Qwen-14b-chat-32k-lora       | 0.34     |
-| LongAlpaca-7b-32k-chinese-v2 | 0.12     |
-| CausalLM-14b                 | 0.086    |
 ###  Evaluation results for passage_retrieval_en in LongBench
-| Models                     | Accuracy |
-|------------------------|----------|
-| **Qwen-14b-chat-yarn-32k** | **0.945**    |
-| Qwen-14b-chat          | 0.24    |
-| chatglm3-32k           | 0.815    |
-| gpt-3.5-turbo-16k      | 0.88     |
 Qwen-14b-chat-yarn-32k has shown significant improvement in multi-document question-answering (or retrieval) tasks and outperforms other models of similar scale.

 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
+* Updated on December 28, 2023: Release Qwen-7b-chat-yarn-32k, but note that the 7b version may be significantly weaker than Qwen-14b-chat-yarn-32k due to the small model size and weak base model capabilities.
 * Updated on December 23, 2023: Release the evaluation results of passage_retrieval_en in LongBench
 * Updated on December 16, 2023: Release [Paper](https://arxiv.org/abs/2312.11193)
 * Updated on December 14, 2023: We have released the Qwen-14b-chat-yarn-32k model, which has been fine-tuned to handle Chinese and English question-answering tasks with a length of up to 32k (approximately 40,000 Chinese characters). This model addresses the low recall issue in multi-document question-answering tasks (also known as the "lost in middle" phenomenon) that was present in the previous 32k model obtained through position interpolation. <br>
 # Evaluation results in LongBench
 ### Evaluation results for passage_retrieval_zh in LongBench
+| Models                       | Accuracy    |
+|------------------------------|-------------|
+| **Qwen-14b-chat-yarn-32k**   | **0.94**    |
+| gpt-3.5-turbo-16k            | 0.81        |
+| chatglm3-32k                 | 0.725       |
+| Qwen-14b-chat                | 0.525       |
+| Qwen-14b-chat-32k-lora       | 0.34        |
+| **Qwen-7b-chat-yarn-32k**    | **0.325**   |
+| Qwen-7b-chat                 | 0.26        |
+| LongAlpaca-7b-32k-chinese-v2 | 0.12        |
+| CausalLM-14b                 | 0.086       |
 ###  Evaluation results for passage_retrieval_en in LongBench
+| Models                           | Accuracy      |
+|----------------------------------|---------------|
+| **Qwen-14b-chat-yarn-32k**       | **0.945**     |
+| chatglm3-32k                     | 0.815         |
+| gpt-3.5-turbo-16k                | 0.88          |
+| **Qwen-7b-chat-yarn-32k**        | **0.47**      |
+| Qwen-14b-chat                    | 0.24          |
+| Qwen-7b-chat                     | 0.235         |
 Qwen-14b-chat-yarn-32k has shown significant improvement in multi-document question-answering (or retrieval) tasks and outperforms other models of similar scale.