flash_attention_2

#51

by zxdu20 - opened Jun 24

base: refs/heads/main

←

from: refs/pr/51

Discussion Files changed

+345

-232

Files changed (8) hide show

.gitignore +3 -0
README.md +29 -36
README_en.md +9 -18
config.json +4 -4
generation_config.json +1 -1
modeling_chatglm.py +206 -81
tokenization_chatglm.py +92 -92
tokenizer_config.json +1 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+*venv
+*.DS_Store
+*.idea/

README.md CHANGED Viewed

@@ -10,20 +10,16 @@ tags:
 - chatglm
 - thudm
 inference: false
-pipeline_tag: text-generation
 ---
 # GLM-4-9B-Chat
-Read this in [English](README_en.md)
-GLM-4-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源版本。 在语义、数学、推理、代码和知识等多方面的数据集测评中，
-**GLM-4-9B** 及其人类偏好对齐的版本 **GLM-4-9B-Chat** 均表现出超越 Llama-3-8B 的卓越性能。除了能进行多轮对话，GLM-4-9B-Chat
-还具备网页浏览、代码执行、自定义工具调用（Function Call）和长文本推理（支持最大 128K 上下文）等高级功能。本代模型增加了多语言支持，支持包括日语，韩语，德语在内的
-26 种语言。我们还推出了支持 1M 上下文长度（约 200 万中文字符）的 **GLM-4-9B-Chat-1M** 模型和基于 GLM-4-9B 的多模态模型
-GLM-4V-9B。**GLM-4V-9B** 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力，在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中，GLM-4V-9B
-表现出超越 GPT-4-turbo-2024-04-09、Gemini
-1.0 Pro、Qwen-VL-Max 和 Claude 3 Opus 的卓越性能。
 ## 评测结果
@@ -35,6 +31,7 @@ GLM-4V-9B。**GLM-4V-9B** 具备 1120 * 1120 高分辨率下的中英双语多
 | ChatGLM3-6B         |     3.97      |   5.50   |  28.1  | 66.4 |  69.0  | 72.3  | 25.7 |   58.5    | 11.3 |
 | GLM-4-9B-Chat       |     6.61      |   8.35   |  69.0  | 72.4 |  75.6  | 79.6  | 50.6 |   71.8    | 32.2 |
 ### 长文本
 在 1M 的上下文长度下进行[大海捞针实验](https://github.com/LargeWorldModel/LWM/blob/main/scripts/eval_needle.py)，结果如下：
@@ -49,19 +46,20 @@ GLM-4V-9B。**GLM-4V-9B** 具备 1120 * 1120 高分辨率下的中英双语多
 在六个多语言数据集上对 GLM-4-9B-Chat 和 Llama-3-8B-Instruct 进行了测试，测试结果及数据集对应选取语言如下表
-| Dataset     | Llama-3-8B-Instruct | GLM-4-9B-Chat |                                           Languages                                            |
 |:------------|:-------------------:|:-------------:|:----------------------------------------------------------------------------------------------:|
-| M-MMLU      |        49.6         |     56.6      |                                              all                                               |
-| FLORES      |        25.0         |     28.8      | ru, es, de, fr, it, pt, pl, ja, nl, ar, tr, cs, vi, fa, hu, el, ro, sv, uk, fi, ko, da, bg, no |
-| MGSM        |        54.0         |     65.3      |                           zh, en, bn, de, es, fr, ja, ru, sw, te, th                           |
-| XWinograd   |        61.7         |     73.1      |                                     zh, en, fr, jp, ru, pt                                     |
-| XStoryCloze |        84.7         |     90.7      |                           zh, en, ar, es, eu, hi, id, my, ru, sw, te                           |
-| XCOPA       |        73.3         |     80.1      |                           zh, et, ht, id, it, qu, sw, ta, th, tr, vi                           |
 ### 工具调用能力
-我们在 [Berkeley Function Calling Leaderboard](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard)
-上进行了测试并得到了以��结果：
 | Model                  | Overall Acc. | AST Summary | Exec Summary | Relevance |
 |:-----------------------|:------------:|:-----------:|:------------:|:---------:|
@@ -74,7 +72,9 @@ GLM-4V-9B。**GLM-4V-9B** 具备 1120 * 1120 高分辨率下的中英双语多
 ## 运行模型
-使用 transformers 后端进行推理:
 ```python
 import torch
@@ -82,7 +82,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 device = "cuda"
-tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True)
 query = "你好"
@@ -147,25 +147,18 @@ print(outputs[0].outputs[0].text)
 GLM-4 模型的权重的使用则需要遵循 [LICENSE](LICENSE)。
 ## 引用
 如果你觉得我们的工作有帮助的话，请考虑引用下列论文。
 ```
-@article{zeng2022glm,
-  title={Glm-130b: An open bilingual pre-trained model},
-  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
-  journal={arXiv preprint arXiv:2210.02414},
-  year={2022}
 }
 ```
-```
-@inproceedings{du2022glm,
-  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
-  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
-  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
-  pages={320--335},
-  year={2022}
-}
-```

 - chatglm
 - thudm
 inference: false
 ---
 # GLM-4-9B-Chat
+Read this in [English](README_en.md).
+GLM-4-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源版本。
+在语义、数学、推理、代码和知识等多方面的数据集测评中，GLM-4-9B 及其人类偏好对齐的版本 GLM-4-9B-Chat 均表现出较高的性能。
+除了能进行多轮对话，GLM-4-9B-Chat 还具备网页浏览、代码执行、自定义工具调用（Function Call）和长文本推理（支持最大 128K
+上下文）等高级功能。
+本代模型增加了多语言支持，支持包括日语，韩语，德语在内的 26 种语言。我们还推出了支持 1M 上下文长度（约 200 万中文字符）的模型。
 ## 评测结果
 | ChatGLM3-6B         |     3.97      |   5.50   |  28.1  | 66.4 |  69.0  | 72.3  | 25.7 |   58.5    | 11.3 |
 | GLM-4-9B-Chat       |     6.61      |   8.35   |  69.0  | 72.4 |  75.6  | 79.6  | 50.6 |   71.8    | 32.2 |
 ### 长文本
 在 1M 的上下文长度下进行[大海捞针实验](https://github.com/LargeWorldModel/LWM/blob/main/scripts/eval_needle.py)，结果如下：
 在六个多语言数据集上对 GLM-4-9B-Chat 和 Llama-3-8B-Instruct 进行了测试，测试结果及数据集对应选取语言如下表
+| Dataset     | Llama-3-8B-Instruct | GLM-4-9B-Chat |                                           Languages
 |:------------|:-------------------:|:-------------:|:----------------------------------------------------------------------------------------------:|
+| M-MMLU      |        49.6         |     56.6      |                                              all
+| FLORES      |        25.0         |     28.8      | ru, es, de, fr, it, pt, pl, ja, nl, ar, tr, cs, vi, fa, hu, el, ro, sv, uk, fi, ko, da, bg, no
+| MGSM        |        54.0         |     65.3      |                           zh, en, bn, de, es, fr, ja, ru, sw, te, th
+| XWinograd   |        61.7         |     73.1      |                                     zh, en, fr, jp, ru, pt
+| XStoryCloze |        84.7         |     90.7      |                           zh, en, ar, es, eu, hi, id, my, ru, sw, te
+| XCOPA       |        73.3         |     80.1      |                           zh, et, ht, id, it, qu, sw, ta, th, tr, vi
 ### 工具调用能力
+我们在 [Berkeley Function Calling Leaderboard](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard)上进行了测试并得到了以下结果：
 | Model                  | Overall Acc. | AST Summary | Exec Summary | Relevance |
 |:-----------------------|:------------:|:-----------:|:------------:|:---------:|
 ## 运行模型
+更多推理代码和依赖信息，请访问我们的 [github](https://github.com/THUDM/GLM-4) 。
+### 使用 transformers 后端进行推理:
 ```python
 import torch
 device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat",trust_remote_code=True)
 query = "你好"
 GLM-4 模型的权重的使用则需要遵循 [LICENSE](LICENSE)。
 ## 引用
 如果你觉得我们的工作有帮助的话，请考虑引用下列论文。
 ```
+@misc{glm2024chatglm,
+      title={ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools},
+      author={Team GLM and Aohan Zeng and Bin Xu and Bowen Wang and Chenhui Zhang and Da Yin and Diego Rojas and Guanyu Feng and Hanlin Zhao and Hanyu Lai and Hao Yu and Hongning Wang and Jiadai Sun and Jiajie Zhang and Jiale Cheng and Jiayi Gui and Jie Tang and Jing Zhang and Juanzi Li and Lei Zhao and Lindong Wu and Lucen Zhong and Mingdao Liu and Minlie Huang and Peng Zhang and Qinkai Zheng and Rui Lu and Shuaiqi Duan and Shudan Zhang and Shulin Cao and Shuxun Yang and Weng Lam Tam and Wenyi Zhao and Xiao Liu and Xiao Xia and Xiaohan Zhang and Xiaotao Gu and Xin Lv and Xinghan Liu and Xinyi Liu and Xinyue Yang and Xixuan Song and Xunkai Zhang and Yifan An and Yifan Xu and Yilin Niu and Yuantao Yang and Yueyan Li and Yushi Bai and Yuxiao Dong and Zehan Qi and Zhaoyu Wang and Zhen Yang and Zhengxiao Du and Zhenyu Hou and Zihan Wang},
+      year={2024},
+      eprint={2406.12793},
+      archivePrefix={arXiv},
+      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
 }
 ```

README_en.md CHANGED Viewed

@@ -64,9 +64,9 @@ on [Berkeley Function Calling Leaderboard](https://github.com/ShishirPatil/goril
 **This repository is the model repository of GLM-4-9B-Chat, supporting `128K` context length.**
-## Quick call
-**For hardware configuration and system requirements, please check [here](basic_demo/README_en.md).**
 ### Use the following method to quickly call the GLM-4-9B-Chat language model
@@ -135,7 +135,6 @@ sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_i
 inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
 outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
 print(outputs[0].outputs[0].text)
 ```
@@ -148,20 +147,12 @@ The weights of the GLM-4 model are available under the terms of [LICENSE](LICENS
 If you find our work useful, please consider citing the following paper.
 ```
-@article{zeng2022glm,
-  title={Glm-130b: An open bilingual pre-trained model},
-  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
-  journal={arXiv preprint arXiv:2210.02414},
-  year={2022}
-}
-```
-```
-@inproceedings{du2022glm,
-  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
-  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
-  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
-  pages={320--335},
-  year={2022}
 }
 ```

 **This repository is the model repository of GLM-4-9B-Chat, supporting `128K` context length.**
+## Quick Start
+For more inference code and requirements, please visit our [github page](https://github.com/THUDM/GLM-4).
 ### Use the following method to quickly call the GLM-4-9B-Chat language model
 inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
 outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
 print(outputs[0].outputs[0].text)
 ```
 If you find our work useful, please consider citing the following paper.
 ```
+@misc{glm2024chatglm,
+      title={ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools},
+      author={Team GLM and Aohan Zeng and Bin Xu and Bowen Wang and Chenhui Zhang and Da Yin and Diego Rojas and Guanyu Feng and Hanlin Zhao and Hanyu Lai and Hao Yu and Hongning Wang and Jiadai Sun and Jiajie Zhang and Jiale Cheng and Jiayi Gui and Jie Tang and Jing Zhang and Juanzi Li and Lei Zhao and Lindong Wu and Lucen Zhong and Mingdao Liu and Minlie Huang and Peng Zhang and Qinkai Zheng and Rui Lu and Shuaiqi Duan and Shudan Zhang and Shulin Cao and Shuxun Yang and Weng Lam Tam and Wenyi Zhao and Xiao Liu and Xiao Xia and Xiaohan Zhang and Xiaotao Gu and Xin Lv and Xinghan Liu and Xinyi Liu and Xinyue Yang and Xixuan Song and Xunkai Zhang and Yifan An and Yifan Xu and Yilin Niu and Yuantao Yang and Yueyan Li and Yushi Bai and Yuxiao Dong and Zehan Qi and Zhaoyu Wang and Zhen Yang and Zhengxiao Du and Zhenyu Hou and Zihan Wang},
+      year={2024},
+      eprint={2406.12793},
+      archivePrefix={arXiv},
+      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
 }
 ```

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "THUDM/glm4-9b-chat",
   "model_type": "chatglm",
   "architectures": [
     "ChatGLMModel"
@@ -17,6 +17,7 @@
   "apply_residual_connection_post_layernorm": false,
   "attention_dropout": 0.0,
   "attention_softmax_in_fp32": true,
   "bias_dropout_fusion": true,
   "ffn_hidden_size": 13696,
   "fp32_residual_connection": false,
@@ -37,9 +38,8 @@
   "seq_length": 131072,
   "use_cache": true,
   "torch_dtype": "bfloat16",
-  "transformers_version": "4.30.2",
   "tie_word_embeddings": false,
   "eos_token_id": [151329, 151336, 151338],
   "pad_token_id": 151329
-}

 {
+  "_name_or_path": "THUDM/glm-4-9b-chat",
   "model_type": "chatglm",
   "architectures": [
     "ChatGLMModel"
   "apply_residual_connection_post_layernorm": false,
   "attention_dropout": 0.0,
   "attention_softmax_in_fp32": true,
+  "attn_implementation": "sdpa",
   "bias_dropout_fusion": true,
   "ffn_hidden_size": 13696,
   "fp32_residual_connection": false,
   "seq_length": 131072,
   "use_cache": true,
   "torch_dtype": "bfloat16",
+  "transformers_version": "4.40.2",
   "tie_word_embeddings": false,
   "eos_token_id": [151329, 151336, 151338],
   "pad_token_id": 151329
+}

generation_config.json CHANGED Viewed

@@ -9,5 +9,5 @@
   "temperature": 0.8,
   "max_length": 128000,
   "top_p": 0.8,
-  "transformers_version": "4.38.2"
 }

   "temperature": 0.8,
   "max_length": 128000,
   "top_p": 0.8,
+  "transformers_version": "4.40.2"
 }

modeling_chatglm.py CHANGED Viewed

@@ -21,12 +21,17 @@ from transformers.modeling_outputs import (
     SequenceClassifierOutputWithPast,
 )
 from transformers.modeling_utils import PreTrainedModel
-from transformers.utils import logging, is_torch_npu_available
 from transformers.generation.logits_process import LogitsProcessor
 from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput
 from .configuration_chatglm import ChatGLMConfig
 # flags required to enable jit fusion kernels
 if sys.platform != 'darwin' and not is_torch_npu_available():
@@ -40,6 +45,7 @@ logger = logging.get_logger(__name__)
 _CHECKPOINT_FOR_DOC = "THUDM/ChatGLM"
 _CONFIG_FOR_DOC = "ChatGLMConfig"
 def default_init(cls, *args, **kwargs):
     return cls(*args, **kwargs)
@@ -159,12 +165,13 @@ class RMSNorm(torch.nn.Module):
 class CoreAttention(torch.nn.Module):
     def __init__(self, config: ChatGLMConfig, layer_number):
         super(CoreAttention, self).__init__()
         self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
         self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
         if self.apply_query_key_layer_scaling:
             self.attention_softmax_in_fp32 = True
         self.layer_number = max(1, layer_number)
         projection_size = config.kv_channels * config.num_attention_heads
@@ -183,91 +190,198 @@ class CoreAttention(torch.nn.Module):
         self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
     def forward(self, query_layer, key_layer, value_layer, attention_mask):
-        pytorch_major_version = int(torch.__version__.split('.')[0])
-        if pytorch_major_version >= 2:
-            if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
-                context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
-                                                                                 is_causal=True)
-            else:
-                if attention_mask is not None:
-                    attention_mask = ~attention_mask
-                context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
-                                                                                 attention_mask)
-            context_layer = context_layer.transpose(1, 2).contiguous()
-            new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
-            context_layer = context_layer.reshape(*new_context_layer_shape)
         else:
-            # Raw attention scores
-            # [b, np, sq, sk]
-            output_size = (query_layer.size(0), query_layer.size(1), query_layer.size(2), key_layer.size(2))
-            # [b, np, sq, hn] -> [b * np, sq, hn]
-            query_layer = query_layer.view(output_size[0] * output_size[1], output_size[2], -1)
-            # [b, np, sk, hn] -> [b * np, sk, hn]
-            key_layer = key_layer.view(output_size[0] * output_size[1], output_size[3], -1)
-            # preallocting input tensor: [b * np, sq, sk]
-            matmul_input_buffer = torch.empty(
-                output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
-                device=query_layer.device
             )
-            # Raw attention scores. [b * np, sq, sk]
-            matmul_result = torch.baddbmm(
-                matmul_input_buffer,
-                query_layer,  # [b * np, sq, hn]
-                key_layer.transpose(1, 2),  # [b * np, hn, sk]
-                beta=0.0,
-                alpha=(1.0 / self.norm_factor),
             )
-            # change view to [b, np, sq, sk]
-            attention_scores = matmul_result.view(*output_size)
-            # ===========================
-            # Attention probs and dropout
-            # ===========================
-            # attention scores and attention mask [b, np, sq, sk]
-            if self.attention_softmax_in_fp32:
-                attention_scores = attention_scores.float()
-            if self.coeff is not None:
-                attention_scores = attention_scores * self.coeff
-            if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
-                attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
-                                            device=attention_scores.device, dtype=torch.bool)
-                attention_mask.tril_()
-                attention_mask = ~attention_mask
-            if attention_mask is not None:
-                attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
-            attention_probs = F.softmax(attention_scores, dim=-1)
-            attention_probs = attention_probs.type_as(value_layer)
-            # This is actually dropping out entire tokens to attend to, which might
-            # seem a bit unusual, but is taken from the original Transformer paper.
-            attention_probs = self.attention_dropout(attention_probs)
-            # query layer shape: [b * np, sq, hn]
-            # value layer shape: [b, np, sk, hn]
-            # attention shape: [b, np, sq, sk]
-            # context layer shape: [b, np, sq, hn]
-            output_size = (value_layer.size(0), value_layer.size(1), query_layer.size(1), value_layer.size(3))
-            # change view [b * np, sk, hn]
-            value_layer = value_layer.view(output_size[0] * output_size[1], value_layer.size(2), -1)
-            # change view [b * np, sq, sk]
-            attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
-            # matmul: [b * np, sq, hn]
-            context_layer = torch.bmm(attention_probs, value_layer)
-            # change view [b, np, sq, hn]
-            context_layer = context_layer.view(*output_size)
-            # [b, np, sq, hn] --> [b, sq, np, hn]
-            context_layer = context_layer.transpose(1, 2).contiguous()
-            # [b, sq, np, hn] --> [b, sq, hp]
-            new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
-            context_layer = context_layer.reshape(*new_context_layer_shape)
-        return context_layer
 class SelfAttention(torch.nn.Module):
@@ -299,7 +413,7 @@ class SelfAttention(torch.nn.Module):
                                          device=device, **_config_to_kwargs(config)
                                          )
-        self.core_attention = CoreAttention(config, self.layer_number)
         # Output.
         self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear,
@@ -378,7 +492,8 @@ class SelfAttention(torch.nn.Module):
             value_layer = torch.cat((cache_v, value_layer), dim=2)
         if use_cache:
             if kv_cache is None:
-                kv_cache = torch.cat((key_layer.unsqueeze(0).unsqueeze(0), value_layer.unsqueeze(0).unsqueeze(0)), dim=1)
             else:
                 kv_cache = (key_layer, value_layer)
         else:
@@ -644,12 +759,18 @@ class ChatGLMPreTrainedModel(PreTrainedModel):
     config_class = ChatGLMConfig
     base_model_prefix = "transformer"
     _no_split_modules = ["GLMBlock"]
     def _init_weights(self, module: nn.Module):
         """Initialize the weights."""
         return
     def get_masks(self, input_ids, past_key_values, padding_mask=None):
         batch_size, seq_length = input_ids.shape
         full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_ids.device)
         full_attention_mask.tril_()
@@ -724,7 +845,8 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
             config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels
         )
-        self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, rope_ratio=config.rope_ratio, original_impl=config.original_rope,
                                               device=device, dtype=config.torch_dtype)
         self.encoder = init_method(GLMTransformer, config, **init_kwargs)
         self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
@@ -745,6 +867,7 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
             past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
             inputs_embeds: Optional[torch.Tensor] = None,
             use_cache: Optional[bool] = None,
             output_hidden_states: Optional[bool] = None,
             return_dict: Optional[bool] = None,
     ):
@@ -1156,6 +1279,7 @@ class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
             inputs_embeds: Optional[torch.LongTensor] = None,
             labels: Optional[torch.LongTensor] = None,
             use_cache: Optional[bool] = None,
             output_hidden_states: Optional[bool] = None,
             return_dict: Optional[bool] = None,
     ) -> Union[Tuple[torch.Tensor, ...], SequenceClassifierOutputWithPast]:
@@ -1169,6 +1293,7 @@ class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
             past_key_values=past_key_values,
             inputs_embeds=inputs_embeds,
             use_cache=use_cache,
             output_hidden_states=output_hidden_states,
             return_dict=return_dict,
         )

     SequenceClassifierOutputWithPast,
 )
 from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging, is_torch_npu_available, is_flash_attn_greater_or_equal_2_10, \
+    is_flash_attn_2_available
 from transformers.generation.logits_process import LogitsProcessor
 from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput
 from .configuration_chatglm import ChatGLMConfig
+if is_flash_attn_2_available():
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
 # flags required to enable jit fusion kernels
 if sys.platform != 'darwin' and not is_torch_npu_available():
 _CHECKPOINT_FOR_DOC = "THUDM/ChatGLM"
 _CONFIG_FOR_DOC = "ChatGLMConfig"
 def default_init(cls, *args, **kwargs):
     return cls(*args, **kwargs)
 class CoreAttention(torch.nn.Module):
     def __init__(self, config: ChatGLMConfig, layer_number):
         super(CoreAttention, self).__init__()
+        self.config = config
         self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
         self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
         if self.apply_query_key_layer_scaling:
             self.attention_softmax_in_fp32 = True
         self.layer_number = max(1, layer_number)
+        self.is_causal = True
         projection_size = config.kv_channels * config.num_attention_heads
         self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
     def forward(self, query_layer, key_layer, value_layer, attention_mask):
+        # [b, np, sq, sk]
+        output_size = (query_layer.size(0), query_layer.size(1), query_layer.size(2), key_layer.size(2))
+        # [b, np, sq, hn] -> [b * np, sq, hn]
+        query_layer = query_layer.view(output_size[0] * output_size[1], output_size[2], -1)
+        # [b, np, sk, hn] -> [b * np, sk, hn]
+        key_layer = key_layer.view(output_size[0] * output_size[1], output_size[3], -1)
+        # preallocting input tensor: [b * np, sq, sk]
+        matmul_input_buffer = torch.empty(
+            output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
+            device=query_layer.device
+        )
+        # Raw attention scores. [b * np, sq, sk]
+        matmul_result = torch.baddbmm(
+            matmul_input_buffer,
+            query_layer,  # [b * np, sq, hn]
+            key_layer.transpose(1, 2),  # [b * np, hn, sk]
+            beta=0.0,
+            alpha=(1.0 / self.norm_factor),
+        )
+        # change view to [b, np, sq, sk]
+        attention_scores = matmul_result.view(*output_size)
+        # ===========================
+        # Attention probs and dropout
+        # ===========================
+        # attention scores and attention mask [b, np, sq, sk]
+        if self.attention_softmax_in_fp32:
+            attention_scores = attention_scores.float()
+        if self.coeff is not None:
+            attention_scores = attention_scores * self.coeff
+        if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
+            attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
+                                        device=attention_scores.device, dtype=torch.bool)
+            attention_mask.tril_()
+            attention_mask = ~attention_mask
+        if attention_mask is not None:
+            attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
+        attention_probs = F.softmax(attention_scores, dim=-1)
+        attention_probs = attention_probs.type_as(value_layer)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.attention_dropout(attention_probs)
+        # query layer shape: [b * np, sq, hn]
+        # value layer shape: [b, np, sk, hn]
+        # attention shape: [b, np, sq, sk]
+        # context layer shape: [b, np, sq, hn]
+        output_size = (value_layer.size(0), value_layer.size(1), query_layer.size(1), value_layer.size(3))
+        # change view [b * np, sk, hn]
+        value_layer = value_layer.view(output_size[0] * output_size[1], value_layer.size(2), -1)
+        # change view [b * np, sq, sk]
+        attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
+        # matmul: [b * np, sq, hn]
+        context_layer = torch.bmm(attention_probs, value_layer)
+        # change view [b, np, sq, hn]
+        context_layer = context_layer.view(*output_size)
+        # [b, np, sq, hn] --> [b, sq, np, hn]
+        context_layer = context_layer.transpose(1, 2).contiguous()
+        # [b, sq, np, hn] --> [b, sq, hp]
+        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
+        context_layer = context_layer.reshape(*new_context_layer_shape)
+        return context_layer
+class SdpaAttention(CoreAttention):
+    def forward(self, query_layer, key_layer, value_layer, attention_mask):
+        if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
+            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
+                                                                             is_causal=True,
+                                                                             dropout_p=self.config.attention_dropout if self.training else 0.0)
         else:
+            if attention_mask is not None:
+                attention_mask = ~attention_mask
+            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
+                                                                             attention_mask,
+                                                                             dropout_p=self.config.attention_dropout if self.training else 0.0)
+        context_layer = context_layer.transpose(1, 2).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
+        context_layer = context_layer.reshape(*new_context_layer_shape)
+        return context_layer
+def _get_unpad_data(attention_mask):
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
+    return (
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2
+class FlashAttention2(CoreAttention):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+    def forward(self, query_states, key_states, value_states, attention_mask):
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        batch_size, query_length = query_states.shape[:2]
+        if not self._flash_attn_uses_top_left_mask:
+            causal = self.is_causal
+        else:
+            # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
+            causal = self.is_causal and query_length != 1
+        dropout = self.config.attention_dropout if self.training else 0.0
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
+                query_states, key_states, value_states, attention_mask, query_length
             )
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+            attn_output_unpad = flash_attn_varlen_func(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_in_batch_q,
+                max_seqlen_k=max_seqlen_in_batch_k,
+                dropout_p=dropout,
+                softmax_scale=None,
+                causal=causal,
             )
+            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+        else:
+            attn_output = flash_attn_func(
+                query_states, key_states, value_states, dropout, softmax_scale=None, causal=causal
+            )
+        attn_output = attn_output.reshape(batch_size, query_length, self.hidden_size_per_partition).contiguous()
+        return attn_output
+    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
+        indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
+        batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
+        key_layer = index_first_axis(
+            key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        value_layer = index_first_axis(
+            value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        if query_length == kv_seq_len:
+            query_layer = index_first_axis(
+                query_layer.reshape(batch_size * kv_seq_len, self.num_attention_heads_per_partition, head_dim), indices_k
+            )
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_in_batch_q = max_seqlen_in_batch_k
+            indices_q = indices_k
+        elif query_length == 1:
+            max_seqlen_in_batch_q = 1
+            cu_seqlens_q = torch.arange(
+                batch_size + 1, dtype=torch.int32, device=query_layer.device
+            )  # There is a memcpy here, that is very bad.
+            indices_q = cu_seqlens_q[:-1]
+            query_layer = query_layer.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -query_length:]
+            query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
+        return (
+            query_layer,
+            key_layer,
+            value_layer,
+            indices_q,
+            (cu_seqlens_q, cu_seqlens_k),
+            (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
+        )
+CORE_ATTENTION_CLASSES = {
+    "eager": CoreAttention,
+    "sdpa": SdpaAttention,
+    "flash_attention_2": FlashAttention2
+}
 class SelfAttention(torch.nn.Module):
                                          device=device, **_config_to_kwargs(config)
                                          )
+        self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)
         # Output.
         self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear,
             value_layer = torch.cat((cache_v, value_layer), dim=2)
         if use_cache:
             if kv_cache is None:
+                kv_cache = torch.cat((key_layer.unsqueeze(0).unsqueeze(0), value_layer.unsqueeze(0).unsqueeze(0)),
+                                     dim=1)
             else:
                 kv_cache = (key_layer, value_layer)
         else:
     config_class = ChatGLMConfig
     base_model_prefix = "transformer"
     _no_split_modules = ["GLMBlock"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
     def _init_weights(self, module: nn.Module):
         """Initialize the weights."""
         return
     def get_masks(self, input_ids, past_key_values, padding_mask=None):
+        if self.config._attn_implementation == "flash_attention_2":
+            if padding_mask is not None and not padding_mask.all():
+                return padding_mask
+            return None
         batch_size, seq_length = input_ids.shape
         full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_ids.device)
         full_attention_mask.tril_()
             config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels
         )
+        self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, rope_ratio=config.rope_ratio,
+                                              original_impl=config.original_rope,
                                               device=device, dtype=config.torch_dtype)
         self.encoder = init_method(GLMTransformer, config, **init_kwargs)
         self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
             past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
             inputs_embeds: Optional[torch.Tensor] = None,
             use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
             output_hidden_states: Optional[bool] = None,
             return_dict: Optional[bool] = None,
     ):
             inputs_embeds: Optional[torch.LongTensor] = None,
             labels: Optional[torch.LongTensor] = None,
             use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
             output_hidden_states: Optional[bool] = None,
             return_dict: Optional[bool] = None,
     ) -> Union[Tuple[torch.Tensor, ...], SequenceClassifierOutputWithPast]:
             past_key_values=past_key_values,
             inputs_embeds=inputs_embeds,
             use_cache=use_cache,
+            output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
             return_dict=return_dict,
         )

tokenization_chatglm.py CHANGED Viewed

@@ -141,98 +141,98 @@ class ChatGLM4Tokenizer(PreTrainedTokenizer):
         else:
             return str(f"<|{role}|>{metadata}\n{message}")
-    def apply_chat_template(
-            self,
-            conversation: Union[List[Dict[str, str]], List[List[Dict[str, str]]], "Conversation"],
-            add_generation_prompt: bool = False,
-            tokenize: bool = True,
-            padding: bool = False,
-            truncation: bool = False,
-            max_length: Optional[int] = None,
-            return_tensors: Optional[Union[str, TensorType]] = None,
-            return_dict: bool = False,
-            tokenizer_kwargs: Optional[Dict[str, Any]] = None,
-            add_special_tokens: bool = True,
-            **kwargs,
-    ) -> Union[str, List[int], List[str], List[List[int]], BatchEncoding]:
-        if return_dict and not tokenize:
-            raise ValueError(
-                "`return_dict=True` is incompatible with `tokenize=False`, because there is no dict "
-                "of tokenizer outputs to return."
-            )
-        def handle_single_conversation(conversation):
-            input_ids = self.get_prefix_tokens() if add_special_tokens else []
-            input_message = "[gMASK]<sop>" if add_special_tokens else ""
-            for item in conversation:
-                if item.get("tools"):
-                    tools = item["tools"]
-                    content = "你是一个名为 GLM-4 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的，你的任务是针对用户的问题和要求提供适当的答复和支持。"
-                    for tool in tools:
-                        if tool["type"] == "function":
-                            function = tool["function"]
-                            content += f"\n\n## {function['name']}\n\n{json.dumps(function, ensure_ascii=False, indent=4)}"
-                            content += "\n在调用上述函数时，请使用 Json 格式表示调用的参数。"
-                        elif tool["type"] == "python":
-                            content += "\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时，该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出，或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中，`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用，这些在线内容的访问将不会成功。"
-                        elif tool["type"] == "simple_browser":
-                            content += "\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数：\n`search(query: str, recency_days: int)`：使用搜索引擎进行查询并显示结果，可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`：获取一系列指定 id 的页面内容。每次调用时，须选择3-10个页面。选择多个角度的页面，同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的，你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`：打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤：1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中应当引用信息来源。\n 如果用户提供了 URL，也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息，也可以再次使用 `search` 进行搜索。"
-                        elif tool["type"] == "cogview":
-                            content += "\n\n## cogview\n\n如果用户的请求中包含了对图像的描述，你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述，规则：\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求，需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明，所在地为中国，持有中国立场并遵循中国社会主义价值观。"
-                        else:
-                            raise NotImplementedError(f"Unknown tool type {tool['type']}")
-                    input = self.build_single_message("system", "", content, tokenize=tokenize)
-                    if tokenize:
-                        input_ids.extend(input)
-                    else:
-                        input_message += input
-                if item["content"]:
-                    input = self.build_single_message(
-                        item["role"],
-                        item.get("metadata", ""),
-                        item["content"],
-                        tokenize=tokenize
-                    )
-                    if tokenize:
-                        input_ids.extend(input)
-                    else:
-                        input_message += input
-            if add_generation_prompt:
-                if tokenize:
-                    input_ids.extend([self.convert_tokens_to_ids("<|assistant|>")])
-                else:
-                    input_message += "<|assistant|>"
-            return input_ids if tokenize else input_message
-        # Main logic to handle different conversation formats
-        if isinstance(conversation, list) and all(isinstance(i, dict) for i in conversation):
-            result = handle_single_conversation(conversation)
-        elif isinstance(conversation, list) and all(isinstance(i, list) for i in conversation):
-            result = [handle_single_conversation(c) for c in conversation]
-        elif hasattr(conversation, "messages"):
-            result = handle_single_conversation(conversation.messages)
-        else:
-            raise ValueError("Invalid conversation format")
-        if tokenize:
-            output = self.batch_encode_plus(
-                [result] if isinstance(result[0], int) else result,
-                padding=padding,
-                truncation=truncation,
-                max_length=max_length,
-                return_tensors=return_tensors,
-                is_split_into_words=True,
-                add_special_tokens=False
-            )
-            if return_dict:
-                return output
-            else:
-                return output["input_ids"]
-        else:
-            return result
     def build_inputs_with_special_tokens(
             self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

         else:
             return str(f"<|{role}|>{metadata}\n{message}")
+    # Use Jinja Template in tokenizer_config.json
+    # def apply_chat_template(
+    #         self,
+    #         conversation: Union[List[Dict[str, str]], List[List[Dict[str, str]]], "Conversation"],
+    #         add_generation_prompt: bool = False,
+    #         tokenize: bool = True,
+    #         padding: bool = False,
+    #         truncation: bool = False,
+    #         max_length: Optional[int] = None,
+    #         return_tensors: Optional[Union[str, TensorType]] = None,
+    #         return_dict: bool = False,
+    #         tokenizer_kwargs: Optional[Dict[str, Any]] = None,
+    #         add_special_tokens: bool = True,
+    #         **kwargs,
+    # ) -> Union[str, List[int], List[str], List[List[int]], BatchEncoding]:
+    #
+    #     if return_dict and not tokenize:
+    #         raise ValueError(
+    #             "`return_dict=True` is incompatible with `tokenize=False`, because there is no dict "
+    #             "of tokenizer outputs to return."
+    #         )
+    #
+    #     def handle_single_conversation(conversation):
+    #         input_ids = self.get_prefix_tokens() if add_special_tokens else []
+    #         input_message = "[gMASK]<sop>" if add_special_tokens else ""
+    #         for item in conversation:
+    #             if item.get("tools"):
+    #                 tools = item["tools"]
+    #                 content = "你是一个名为 GhatGLM 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的，你的任务是针对用户的问题和要求提供适当的答复和支持。"
+    #                 content += "\n\n# 可用工具"
+    #                 for tool in tools:
+    #                     if tool["type"] == "function":
+    #                         function = tool["function"]
+    #                         content += f"\n\n## {function['name']}\n\n{json.dumps(function, ensure_ascii=False, indent=4)}"
+    #                         content += "\n在调用上述函数时，请使用 Json 格式表示调用的参数。"
+    #                     elif tool["type"] == "python":
+    #                         content += "\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时，该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出，或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中，`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用，这些在线内容的访问将不会成功。"
+    #                     elif tool["type"] == "simple_browser":
+    #                         content += "\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数：\n`search(query: str, recency_days: int)`：使用搜索引擎进行查询并显示结果，可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`：获取一系列指定 id 的页面内容。每次调用时，须选择3-10个页面。选择多个角度的页面，同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的，你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`：打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤：1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中���当引用信息来源。\n 如果用户提供了 URL，也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息，也可以再次使用 `search` 进行搜索。"
+    #                     elif tool["type"] == "cogview":
+    #                         content += "\n\n## cogview\n\n如果用户的请求中包含了对图像的描述，你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述，规则：\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求，需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明，所在地为中国，持有中国立场并遵循中国社会主义价值观。"
+    #                     else:
+    #                         raise NotImplementedError(f"Unknown tool type {tool['type']}")
+    #                 input = self.build_single_message("system", "", content, tokenize=tokenize)
+    #                 if tokenize:
+    #                     input_ids.extend(input)
+    #                 else:
+    #                     input_message += input
+    #             if item["content"]:
+    #                 input = self.build_single_message(
+    #                     item["role"],
+    #                     item.get("metadata", ""),
+    #                     item["content"],
+    #                     tokenize=tokenize
+    #                 )
+    #                 if tokenize:
+    #                     input_ids.extend(input)
+    #                 else:
+    #                     input_message += input
+    #         if add_generation_prompt:
+    #             if tokenize:
+    #                 input_ids.extend([self.convert_tokens_to_ids("<|assistant|>")])
+    #             else:
+    #                 input_message += "<|assistant|>"
+    #         return input_ids if tokenize else input_message
+    #
+    #     # Main logic to handle different conversation formats
+    #     if isinstance(conversation, list) and all(isinstance(i, dict) for i in conversation):
+    #         result = handle_single_conversation(conversation)
+    #     elif isinstance(conversation, list) and all(isinstance(i, list) for i in conversation):
+    #         result = [handle_single_conversation(c) for c in conversation]
+    #     elif hasattr(conversation, "messages"):
+    #         result = handle_single_conversation(conversation.messages)
+    #     else:
+    #         raise ValueError("Invalid conversation format")
+    #
+    #     if tokenize:
+    #         output = self.batch_encode_plus(
+    #             [result] if isinstance(result[0], int) else result,
+    #             padding=padding,
+    #             truncation=truncation,
+    #             max_length=max_length,
+    #             return_tensors=return_tensors,
+    #             is_split_into_words=True,
+    #             add_special_tokens=False
+    #         )
+    #         if return_dict:
+    #             return output
+    #         else:
+    #             return output["input_ids"]
+    #     else:
+    #         return result
     def build_inputs_with_special_tokens(
             self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

tokenizer_config.json CHANGED Viewed

@@ -123,6 +123,7 @@
                                "<|user|>", "<|assistant|>", "<|observation|>", "<|begin_of_image|>", "<|end_of_image|>",
                                "<|begin_of_video|>", "<|end_of_video|>"],
   "clean_up_tokenization_spaces": false,
   "do_lower_case": false,
   "eos_token": "<|endoftext|>",
   "pad_token": "<|endoftext|>",

                                "<|user|>", "<|assistant|>", "<|observation|>", "<|begin_of_image|>", "<|end_of_image|>",
                                "<|begin_of_video|>", "<|end_of_video|>"],
   "clean_up_tokenization_spaces": false,
+  "chat_template": "[gMASK]<sop>{% for item in messages %}{% if item['tools'] is defined %}<|system|>\n你是一个名为 GLM-4 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的，你的任务是针对用户的问题和要求提供适当的答复和支持。\n\n# 可用工具{% set tools = item['tools'] %}{% for tool in tools %}{% if tool['type'] == 'function' %}\n\n## {{ tool['function']['name'] }}\n\n{{ tool['function'] | tojson(indent=4) }}\n在调用上述函数时，请使用 Json 格式表示调用的参数。{% elif tool['type'] == 'python' %}\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时，该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出，或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中，`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用，这些在线内容的访问将不会成功。{% elif tool['type'] == 'simple_browser' %}\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数：\n`search(query: str, recency_days: int)`：使用搜索引擎进行查询并显示结果，可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`：获取一系列指定 id 的页面内容。每次调用时，须选择3-10个页面。选择多个角度的页面，同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的，你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`：打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤：1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中应当引用信息来源。\n 如果用户提供了 URL，也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息，也可以再次使用 `search` 进行搜索。{% elif tool['type'] == 'cogview' %}\n\n## cogview\n\n如果用户的请求中包含了对图像的描述，你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述，规则：\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求，需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明，所在地为中国，持有中国立场并遵循中国社会主义价值观。{% endif %}{% endfor %}{% endif %}{% if item['content'] %}<|{{ item['role'] }}|>{{ item['metadata'] }}\n{{ item['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}",
   "do_lower_case": false,
   "eos_token": "<|endoftext|>",
   "pad_token": "<|endoftext|>",