First model version

Browse files

Files changed (12) hide show

README.md +75 -0
added_tokens.json +40 -0
config.json +44 -0
generation_config.json +6 -0
merges.txt +0 -0
pytorch_model-00001-of-00002.bin +3 -0
pytorch_model-00002-of-00002.bin +3 -0
pytorch_model.bin.index.json +779 -0
special_tokens_map.json +5 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,78 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
 ---
+---
+license: apache-2.0
+---
+# **MoLM**
+MoLM is a collection of MoE-based language models ranging in scale from 4 billion to 8 billion parameters. This is the repository for the 8B pretrained model, converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom.
+**Model Usage**
+To load the model, you need install the [ModuleFormer package](https://github.com/IBM/ModuleFormer). Then you can load the model with the following code:
+```
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
+from moduleformer import ModuleFormerForCausalLM, ModuleFormerConfig, ModuleFormerForSequenceClassification
+AutoConfig.register("moduleformer", ModuleFormerConfig)
+AutoModelForCausalLM.register(ModuleFormerConfig, ModuleFormerForCausalLM)
+AutoModelForSequenceClassification.register(ModuleFormerConfig, ModuleFormerForSequenceClassification)
+tokenizer = AutoTokenizer.from_pretrained('ibm/MoLM-350M-4B')
+model = AutoModelForCausalLM.from_pretrained('ibm/MoLM-350M-4B')
+```
+**Model Details**
+MoLM-350M-4B is a MoE-based language models. It has 4 billion parameters, but each input token will only use 350M parameteres during its inference. Thus, it's computationally equivelant to a 350M dense model.
+MoLM-700M-8B has 8 billion parameters and computationally equivelant to a 700M dense model.
+Both models are trained on 300 billion tokens from publicly available sources, with a learning rate of 3.0 x 10<sup>-4</sup> and a global batch-size of 3M tokens.
+**Model Developers** IBM
+**Variations** MoLM comes in two different parameter sizes — 4B and 8B.
+**Input** Models input text only.
+**Output** Models generate text only.
+**Model Architecture** MoLM is an auto-regressive language model that uses the ModuleFormer architecture. It has 16 attention modules in each attention layer and 32 MLP modules in each MLP layer. During inference, the model activate 2 modules in each layer for each token condition on the inputs. MoLM-350M-4B has 24 blocks and MoLM-700M-8B has 48 blocks.
+**Status** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.
+**Research Paper** ["ModuleFormer: Modularity Emerges from Mixture-of-Experts"](https://arxiv.org/abs/2306.04640)
+## Training Data
+MoLM was pretrained on 300 billion tokens of data from publicly available sources.
+## Evaluation Results
+In this section, we report the results for the MoLM-350M-4B and MoLM-700M-8B models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library.
+|Model|Latency|Memory|Throughput|Hellaswag|PIQA|ARC-e|ARC-c|OBQA|
+|---|---|---|---|---|---|---|---|---|
+||ms|GB|tokens/sec|acc|acc|acc|acc|acc|
+|Pythia 410M|554|25|59594|33.72|66.70|51.89|21.42|18.2|
+|GPT-Neo 1.3B|991|23|32857|38.66|71.11|56.19|23.12|21.4|
+|Pythia 1.4B|918|42|35559|40.41|70.84|60.52|26.11|22.2|
+|MoLM-350M-4B|497|27|71017|39.21|70.13|56.44|23.55|20.8|
+|GPT-Neo 2.7B|1737|35|18788|42.71|72.2|61.07|27.47|23.2|
+|Pythia 2.8B|2111|70|15522|45.34|73.99|64.35|29.35|23.8|
+|MoLM-700M-8B|939|38|37419|43.33|72.91|62.46|27.90|23.8|
+|Model| |TriviaQA| | | HumanEval| |Wikitext|
+|---|---|---|---|---|---|---|---|
+||0-shot |1-shot |5-shot |pass@1 |pass@10 |pass@100 |PPL|
+|Pythia 410M |2.32 |5.02 |6.42 |1.20 |3.85 |9.98 |20.09 |
+|GPT-Neo 1.3B |5.24 |8.01 |9.74 |3.62 |6.87 |14.50 |16.16 |
+|Pythia 1.4B |5.30 |9.87 |12.84 |2.19 |7.31 |14.33 |14.71|
+|MoLM-350M-4B |5.40 |11.12 |13.70 |3.04 |6.99 |13.79 |15.15 |
+|GPT-Neo 2.7B |4.82 |11.23 |13.67 |4.89 |9.54 |17.90 |13.93 |
+|Pythia 2.8B |7.38 |15.58 |18.98 |4.91 |11.76 |21.54 |12.68|
+|MoLM-700M-8B |11.47 |16.73 |20.75 |5.51 |12.58 |20.40 |12.97 |
+## Ethical Considerations and Limitations
+MoLM is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, MoLM’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of MoLM, developers should perform safety testing and tuning tailored to their specific applications of the model.
+## MoLM Model Index
+|Model|MoLM|
+|---|---|
+|350M-4B| [Link](https://huggingface.co/ibm/MoLM-350M-4B) |
+|700M-8B| [Link](https://huggingface.co/ibm/MoLM-700M-8B) |

added_tokens.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "\t\t": 50294,
+  "\t\t\t": 50293,
+  "\t\t\t\t": 50292,
+  "\t\t\t\t\t": 50291,
+  "\t\t\t\t\t\t": 50290,
+  "\t\t\t\t\t\t\t": 50289,
+  "\t\t\t\t\t\t\t\t": 50288,
+  "\t\t\t\t\t\t\t\t\t": 50287,
+  "  ": 50286,
+  "   ": 50285,
+  "    ": 50284,
+  "     ": 50283,
+  "      ": 50282,
+  "       ": 50281,
+  "        ": 50280,
+  "         ": 50279,
+  "          ": 50278,
+  "           ": 50277,
+  "            ": 50276,
+  "             ": 50275,
+  "              ": 50274,
+  "               ": 50273,
+  "                ": 50272,
+  "                 ": 50271,
+  "                  ": 50270,
+  "                   ": 50269,
+  "                    ": 50268,
+  "                     ": 50267,
+  "                      ": 50266,
+  "                       ": 50265,
+  "                        ": 50264,
+  "                         ": 50263,
+  "                          ": 50262,
+  "                           ": 50261,
+  "                            ": 50260,
+  "                             ": 50259,
+  "                              ": 50258,
+  "                               ": 50257
+}

config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "_name_or_path": "./MoLM-700M-8B",
+  "activation_function": "gelu_new",
+  "architectures": [
+    "ModuleFormerForCausalLM"
+  ],
+  "att_func": "stickbreaking",
+  "att_hidden": 1024,
+  "attn_pdrop": 0,
+  "aux_loss_type": "mi",
+  "aux_loss_weight": 0,
+  "block_size": 512,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0,
+  "eos_token_id": 50256,
+  "ffd_hidden": 2048,
+  "gate_type": "mlp",
+  "gating_size": 256,
+  "history_length": 512,
+  "initializer_range": 0.02,
+  "k_att": 2,
+  "k_mlp": 2,
+  "layer_norm_epsilon": 1e-05,
+  "local_size": 1,
+  "model_type": "moduleformer",
+  "moe_pdrop": 0,
+  "moe_type": "moe",
+  "n_att_experts": 16,
+  "n_ctx": 24576,
+  "n_embd": 1024,
+  "n_head": 1,
+  "n_layer": 48,
+  "n_mlp_experts": 32,
+  "pre_norm": true,
+  "resid_pdrop": 0,
+  "sample_topk": 0,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.28.1",
+  "universal": false,
+  "use_cache": true,
+  "vocab_size": 50295,
+  "world_size": null
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 50256,
+  "eos_token_id": 50256,
+  "transformers_version": "4.28.1"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model-00001-of-00002.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:52477624553ada43d5b61d562756ec8cca326714ca1ff0fab751a4192c3e65b4
+size 9993270641

pytorch_model-00002-of-00002.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd2504c16d078188953043c0486374925e2c10faae64de0b3e9a621e3faaef42
+size 6522083237

pytorch_model.bin.index.json ADDED Viewed

	@@ -0,0 +1,779 @@

+{
+  "metadata": {
+    "total_size": 16515088384
+  },
+  "weight_map": {
+    "lm_head.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.0.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.0.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.1.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.10.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.11.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.12.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.13.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.14.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.15.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.16.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.17.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.18.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.19.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.2.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.20.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.21.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.22.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.23.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.24.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.25.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.26.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.27.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.28.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.29.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.29.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.29.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.29.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.29.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.29.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.29.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.3.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.3.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.30.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.30.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.31.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.32.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.33.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.34.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.35.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.36.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.37.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.38.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.39.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.4.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.4.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.40.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.40.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.41.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.42.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.43.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.44.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.45.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.46.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.attn.cum_weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.attn.mask": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.attn.q_proj.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.attn.q_proj.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.ln_1.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.ln_1.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.ln_2.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.ln_2.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.mlpf.experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.mlpf.gate.w_gate.0.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.mlpf.gate.w_gate.0.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.mlpf.gate.w_gate.3.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.47.mlpf.output_experts.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.h.5.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.5.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.6.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.7.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.8.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.attn.cum_weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.attn.mask": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.attn.q_proj.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.attn.q_proj.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.attn.q_proj.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.attn.q_proj.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.attn.q_proj.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.ln_1.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.ln_1.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.ln_2.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.ln_2.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.mlpf.experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.mlpf.gate.w_gate.0.bias": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.mlpf.gate.w_gate.0.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.mlpf.gate.w_gate.3.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.h.9.mlpf.output_experts.weight": "pytorch_model-00001-of-00002.bin",
+    "transformer.ln_f.bias": "pytorch_model-00002-of-00002.bin",
+    "transformer.ln_f.weight": "pytorch_model-00002-of-00002.bin",
+    "transformer.wte.weight": "pytorch_model-00001-of-00002.bin"
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "add_prefix_space": false,
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 2048,
+  "tokenizer_class": "CodeGenTokenizer",
+  "unk_token": "<|endoftext|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff