Failed to run at koboldcpp and llama.cpp

#1
by FenixInDarkSolo - opened

Hello, I have download the model and try to run on koboldcpp. But it does not work.
I have checked the SHA256 and confirm the file is completed.

# in llama.cpp
error loading model: unrecognized tensor type 7

#in koboldcpp
Input: {"n": 1, "max_context_length": 2048, "max_length": 256, "rep_pen": 1.15, "temperature": 1, "top_p": 0.1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n\n\n### Instruction:\n\n\u80fd\u8aaa\u4e2d\u6587\u55ce\uff1f\n\n### Response:\n\n", "quiet": true, "stop_sequence": ["\n### Instruction:", "\n### Response:"]}

Processing Prompt [BLAS] (45 / 45 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 819479152, available 805306368)
----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 57955)
Traceback (most recent call last):
  File "socketserver.py", line 316, in _handle_request_noblock
  File "socketserver.py", line 347, in process_request
  File "socketserver.py", line 360, in finish_request
  File "koboldcpp.py", line 196, in __call__
  File "http\server.py", line 651, in __init__
  File "socketserver.py", line 747, in __init__
  File "http\server.py", line 425, in handle
  File "http\server.py", line 413, in handle_one_request
  File "koboldcpp.py", line 297, in do_POST
  File "koboldcpp.py", line 170, in generate
OSError: exception: access violation writing 0x0000000000000000
----------------------------------------

Hello, I have download the model and try to run on koboldcpp. But it does not work.
I have checked the SHA256 and confirm the file is completed.

# in llama.cpp
error loading model: unrecognized tensor type 7

llama.cpp quantization methods have been updated in May, please try cloning the latest llama.cpp repo and re-compile before loading the model.
It works well in my device.

As for koboldcpp, I have not tested it. I would test it when I have time.

mmm... I believe I have checkout the latest version of llama.cpp?

D:\program\llama.cpp>python runner_interactive.py
main -m ./models/chinese-Alpaca-7b-plus-ggml-q5_1.bin -t 12 -n -1 -c 2048 --keep -1 --repeat_last_n 2048 --top_k 160 --top_p 0.95 --color -ins -r "User:" --keep -1 --interactive-first
main: build = 536 (cdd5350)
main: seed  = 1683959650
llama.cpp: loading model from ./models/chinese-Alpaca-7b-plus-ggml-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/chinese-Alpaca-7b-plus-ggml-q5_1.bin'
main: error: unable to load model

D:\program\llama.cpp>git log -n 1 --pretty=format:'%H'
'cdd5350892b1d4e521e930c77341f858fcfcd433'

D:\program\llama.cpp>git merge fb62f924336c9746da9976c6ab3c2e6460258d54
Already up to date.

And I have tested in the newest koboldcpp (1.21) and it works with a warning.

System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from D:\program\koboldcpp\chinese-Alpaca-7b-plus-ggml-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B

Legacy LLAMA GGJT compatability changes triggered.
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 6749.78 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

---
Warning: Your model has an INVALID or OUTDATED format (ver 3). Please reconvert it for better results!
---
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001

check your own output:

error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)

llama.cpp developers just updated their Quantization formats Q4 and Q5 in May 11th, and old q5_1 no longer supported.
Maybe try one of them:

  1. Clone the latest repo, re-compile and re-quantize yourself.
  2. Load q8_0 format model.
  3. Clone a old repo before May 11th and re-compile to load q5_1 model I provided.

5_1 success run in koboldcpp 1.21 (instruct mode). q8_0 success run in newest llama.cpp. Thanks a lot. (^_^)b

Billsfriend pinned discussion

Sign up or log in to comment