TheBloke/goliath-120b-GGUF · Strange error while running model

Nov 17, 2023

@TheBloke maybe you know the quick fix to

ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)

error while trying to run the goliath-120b.Q2_K.gguf model with llama-cpp-python?

Below are the model loading log:

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 137
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = mostly Q2_K
llm_load_print_meta: model params     = 117.75 B
llm_load_print_meta: model size       = 46.22 GiB (3.37 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.45 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 2691.22 MB
llm_load_tensors: offloading 130 repeating layers to GPU
llm_load_tensors: offloaded 130/140 layers to GPU
llm_load_tensors: VRAM used: 44638.75 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1096.00 MB
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)

I've tried to change gpu_layers number, context length - nothing helps, and always the same error with same numbers

Thanks!

dillfrescott

Nov 19, 2023

Same exact error attempting this model on runpod. I genuinely have no clue whats causing it. It works on my main machine...

dillfrescott

Nov 19, 2023

@TheBloke any ideas?

dillfrescott

Nov 19, 2023

Looks like it might be fixed with this commit https://github.com/ggerganov/llama.cpp/commit/bbecf3f415797f812893947998bda4f866fa900e

caga

Nov 21, 2023

Same problem here: running goliath-120b.Q6_K.gguf with ctransformers in a 2xXeon, 128RAM, 8Gb NVIDIA.

dillfrescott

Nov 21, 2023

Seems to me that the same value that was increased in ccp needs to be increased somewhere in the ctransformers library as well.

caga

Nov 22, 2023

Problem solved using llama-cpp-python, without any changes in llama source code. Now I have to figure out how send to some layers to the GPU... noob issues :) Thanks!

Rednero

Dec 15, 2023

This comment has been hidden