Weird issue with Llama.cpp and Q4_K_M model
Hello everyone,
I'm pretty new to llama.cpp and really have been using it only through other tools so my question might be stupid.
So, I'm trying to load a model using a tool called gradio (https://www.gradio.app/) in order to build quick apps that I can share with team members.
Gradio uses Llama.cpp to load GGUF models for inference.
I'm currently looking for models that allows me to do some special NER tasks.
Being not quite happy with what I have, I decided to use a tool called Ludwig to fine-tune models in order to specialise model to a specific extraction task..
So I took several models, fine-tuned them and created Lora adapters files ( .safetensors files), converted those files to ggml files and then exported those files to GGUF.
Then I use gradio to host those models and validate the quality of those models.
It did work for most of the models I tested, Llama2, Llama3, Mixtral-8 and Phi-3.
Most of the time I use quantizationed versions of those models because I'm quite GPU limited
But when I tested with an open-GPT4 model (https://huggingface.co/TheBloke/Open_Gpt4_8x7B_v0.2-GGUF) with quantizationton 4 bits open_gpt4_8x7b_v0.2.Q4_K_M.gguf I had a very weird issue:
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from /workspace/.cache/huggingface/hub/models--XXXXX--Open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter-GGUF/snapshots/56e5dfed05fdd997713eec5e12ea5fbb28dd337d/./open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter.gguf (version GGUF V3 (latest))
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 0: general.architecture str = llama
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 1: general.name str = rombodawg_open_gpt4_8x7b_v0.2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 2: llama.context_length u32 = 32768
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 4: llama.block_count u32 = 32
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 10: llama.expert_count u32 = 8
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 11: llama.expert_used_count u32 = 2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 13: general.file_type u32 = 15
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - kv 24: general.quantization_version u32 = 2
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type f32: 65 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type f16: 32 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type q8_0: 64 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type q4_K: 705 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_loader: - type q6_K: 129 tensors
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_vocab: special tokens cache size = 259
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_vocab: token to piece cache size = 0.1637 MB
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: format = GGUF V3 (latest)
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: arch = llama
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: vocab type = SPM
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_vocab = 32000
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_merges = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_ctx_train = 32768
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd = 4096
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_head = 32
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_head_kv = 8
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_layer = 32
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_rot = 128
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_swa = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_head_k = 128
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_head_v = 128
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_gqa = 4
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_k_gqa = 1024
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_embd_v_gqa = 1024
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_norm_eps = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_norm_rms_eps = 1.0e-05
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_clamp_kqv = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: f_logit_scale = 0.0e+00
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_ff = 14336
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_expert = 8
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_expert_used = 2
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: causal attn = 1
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: pooling type = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: rope type = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: rope scaling = linear
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: freq_base_train = 1000000.0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: freq_scale_train = 1
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: n_ctx_orig_yarn = 32768
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: rope_finetuned = unknown
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_d_conv = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_d_inner = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_d_state = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: ssm_dt_rank = 0
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model type = 8x7B
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model ftype = Q4_K - Medium
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model params = 46.70 B
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: model size = 26.43 GiB (4.86 BPW)
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: general.name = rombodawg_open_gpt4_8x7b_v0.2
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: BOS token = 1 '<s>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: EOS token = 2 '</s>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: UNK token = 0 '<unk>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: PAD token = 0 '<unk>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: LF token = 13 '<0x0A>'
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_print_meta: max token length = 48
2024-09-03T13:47:33Z [app] [l4jdt] llm_load_tensors: ggml ctx size = 0.38 MiB
2024-09-03T13:47:33Z [app] [l4jdt] llama_model_load: error loading model: create_tensor_as_view: tensor 'blk.4.ffn_down.2.weight' has wrong type; expected q4_K, got q6_K
2024-09-03T13:47:33Z [app] [l4jdt] llama_load_model_from_file: failed to load model
2024-09-03T13:47:33Z [app] [l4jdt] Traceback (most recent call last):
2024-09-03T13:47:33Z [app] [l4jdt] File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-09-03T13:47:33Z [app] [l4jdt] return _run_code(code, main_globals, None,
2024-09-03T13:47:33Z [app] [l4jdt] File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-09-03T13:47:33Z [app] [l4jdt] exec(code, run_globals)
2024-09-03T13:47:33Z [app] [l4jdt] File "/workspace/pdf2stix_demo/app.py", line 42, in <module>
2024-09-03T13:47:33Z [app] [l4jdt] generator = generator_cls(
2024-09-03T13:47:33Z [app] [l4jdt] File "<string>", line 9, in __init__
2024-09-03T13:47:33Z [app] [l4jdt] File "/workspace/pdf2stix_demo/json_generator.py", line 129, in __post_init__
2024-09-03T13:47:33Z [app] [l4jdt] self.llama = Llama.from_pretrained(
2024-09-03T13:47:33Z [app] [l4jdt] File "/workspace/.venv/lib/python3.10/site-packages/llama_cpp/llama.py", line 2091, in from_pretrained
2024-09-03T13:47:33Z [app] [l4jdt] return cls(
2024-09-03T13:47:33Z [app] [l4jdt] File "/workspace/.venv/lib/python3.10/site-packages/llama_cpp/llama.py", line 358, in __init__
2024-09-03T13:47:33Z [app] [l4jdt] self._model = self._stack.enter_context(contextlib.closing(_LlamaModel(
2024-09-03T13:47:33Z [app] [l4jdt] File "/workspace/.venv/lib/python3.10/site-packages/llama_cpp/_internals.py", line 54, in __init__
2024-09-03T13:47:33Z [app] [l4jdt] raise ValueError(f"Failed to load model from file: {path_model}")
2024-09-03T13:47:33Z [app] [l4jdt] ValueError: Failed to load model from file: /workspace/.cache/huggingface/hub/models--XXXXX--Open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter-GGUF/snapshots/56e5dfed05fdd997713eec5e12ea5fbb28dd337d/./open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter.gguf
2024-09-03T13:47:33Z [app] [l4jdt] Running with Namespace(model_id='XXXXX/Open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter-GGUF', gguf_filename='open_gpt4_8x7b_v0.2.Q4_K_M.Lora-Adapter.gguf', llm_backend=<LLMBackend.llamacpp: 'llamacpp'>, no_gpu=False, context_length=0, gradio_port=8080, few_shots=PosixPath('data/few-shots-sparsify-relationwise.jsonl'), n_shots=2, random_seed=42)
I haven't found this error anywhere so I'm quite not sure what is happening here, if anyone had an idea that would be great.
Thx in advance