Model trained on 300k CSharp related instructions and extra tuned on 50k specific GPT generated short ones

## Training
Finetuned for CSharp [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct). Max context length is restricted to 1024 tokens.

'Loss': 0.256045166015625 on 300k CSharp-related records
'Loss': 0.095714599609375 on 50k specific short prompts

## Sources
data contained (most data was around 500 tokens long < 1000, except large code files):
- codeparrot/github-code C# ("mit", "Apache-2.0", "Bsd-3-clause", "Bsd-2-clause", "Cc0-1.0", "Unlicense", "isc")
- raw data Plain .cs files randomly cut at the 60-80% in the instruction, and we ask the network to continue last 40-20% (76k)
- documented static functions 72k
- SO 5q_5answer + 5q_5best (CC BY-SA 4.0) 70k
- Dotnet wiki (30k, rendered out from [github repo](https://github.com/microsoft/dotnet), see also removed, GPT-4 generated short question to each file)
- All NM Static Functions and Tests (from [nethermind client repo](https://github.com/NethermindEth/nethermind) documented and described via GPT-4 (4k)
- GPT-4 questions, GPT-3.5 answers for CSharp: Short Q->Code, Explain Code X > Step-By-Step (35k)
- GPT-4 questions, GPT-3.5 answers for nethermind client interface `IEthRpcModule `: Short Q->Code, Explain Code X -> Step-By-Step (7k)

## Contents
- HF compatible model
- GGML compatible quantisations (f16, q8, q5)

Files changed (8) hide show

config.json +52 -0
ggml-model-f16.bin +3 -0
ggml-model-q5_0.bin +3 -0
ggml-model-q8_0.bin +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0

config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "architectures": [
+    "MPTForCausalLM"
+  ],
+  "attn_config": {
+    "alibi": true,
+    "alibi_bias_max": 8,
+    "attn_impl": "torch",
+    "attn_pdrop": 0,
+    "attn_type": "multihead_attention",
+    "attn_uses_sequence_id": false,
+    "clip_qkv": null,
+    "prefix_lm": false,
+    "qk_ln": false,
+    "softmax_scale": null
+  },
+  "auto_map": {
+    "AutoConfig": "mosaicml/mpt-7b-instruct--configuration_mpt.MPTConfig",
+    "AutoModelForCausalLM": "mosaicml/mpt-7b-instruct--modeling_mpt.MPTForCausalLM"
+  },
+  "d_model": 4096,
+  "emb_pdrop": 0,
+  "embedding_fraction": 1.0,
+  "expansion_ratio": 4,
+  "init_config": {
+    "emb_init_std": null,
+    "emb_init_uniform_lim": null,
+    "fan_mode": "fan_in",
+    "init_div_is_residual": true,
+    "init_gain": 0,
+    "init_nonlinearity": "relu",
+    "init_std": 0.02,
+    "name": "kaiming_normal_",
+    "verbose": 0
+  },
+  "init_device": "cuda:0",
+  "learned_pos_emb": true,
+  "logit_scale": null,
+  "max_seq_len": 1024,
+  "model_type": "mpt",
+  "n_heads": 32,
+  "n_layers": 32,
+  "no_bias": true,
+  "norm_type": "low_precision_layernorm",
+  "resid_pdrop": 0,
+  "tokenizer_name": "EleutherAI/gpt-neox-20b",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.31.0",
+  "use_cache": false,
+  "verbose": 0,
+  "vocab_size": 50432
+}

ggml-model-f16.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4102aa709b6983a0ab92f1e21538c16003ac1eed0942ee571573328d1cb79585
+size 13299639642

ggml-model-q5_0.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24d040eff4d2532d6b5283e2a02b7ed86d8f5e46cee387b06a5dbc666794bf5c
+size 4572800346

ggml-model-q8_0.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:de3a21a45e14f71d9fcdb4d65b403831499bdd28ba0c5f1c808015602b2c23f5
+size 7066183002

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4359b2439625f2a54998a9fc93cf16fbc4b94629cfdc50fe263f68280b3ddb9
+size 13298660777

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "add_prefix_space": false,
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1000000000000000019884624838656,
+  "tokenizer_class": "GPTNeoXTokenizer",
+  "unk_token": "<|endoftext|>"
+}