Strange error while running model
@TheBloke maybe you know the quick fix to
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)
error while trying to run the goliath-120b.Q2_K.gguf model with llama-cpp-python?
Below are the model loading log:
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 137
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = mostly Q2_K
llm_load_print_meta: model params = 117.75 B
llm_load_print_meta: model size = 46.22 GiB (3.37 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.45 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 2691.22 MB
llm_load_tensors: offloading 130 repeating layers to GPU
llm_load_tensors: offloaded 130/140 layers to GPU
llm_load_tensors: VRAM used: 44638.75 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1096.00 MB
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)
I've tried to change gpu_layers number, context length - nothing helps, and always the same error with same numbers
Thanks!
Same exact error attempting this model on runpod. I genuinely have no clue whats causing it. It works on my main machine...
@TheBloke any ideas?
Looks like it might be fixed with this commit https://github.com/ggerganov/llama.cpp/commit/bbecf3f415797f812893947998bda4f866fa900e
Same problem here: running goliath-120b.Q6_K.gguf with ctransformers in a 2xXeon, 128RAM, 8Gb NVIDIA.
Seems to me that the same value that was increased in ccp needs to be increased somewhere in the ctransformers library as well.
Problem solved using llama-cpp-python, without any changes in llama source code. Now I have to figure out how send to some layers to the GPU... noob issues :) Thanks!