how to run encoder-decoder models with llama.cpp
Ciao Felladrin,
is this model working for you with llama-cpp-python?
I was never able to make it work...
Hi, Fabio!
I haven't tried it in Python. Maybe the library hasn't been updated yet? The non-sharded version hasn't worked either?
On the other hand, I've successfully been using this model with wllama, so I can confirm it's not an issue with the GGUF files.
Ciao Victor, I am a big fan of yours, and your models are all over my Medium articles...
LaMini series have been my beloved friends for more than 16 months, and now to be able to use them with llamaCPP would be a blessing.
But...
here my results with the non sharded version:
>>> from llama_cpp import Llama
>>> model = Llama(model_path='model/LaMini-Flan-T5-248M.Q8_0.gguf')
Up to here no problems, the model load correctly
print(model('what is science?',stop=['</s>']))
I got always this error, regardless of the inference method
D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\src\llama.cpp:13276: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed
any ideas?
I tried to use directly the llama.cpp repo, without python bindings.
It works only like this
.\llama-cli.exe -m .\models\LaMini-Flan-T5-248M.Q8_0.gguf -p 'what is science?'
Science is the branch of study that deals with the study of the natural world, including the study of the physical, chemical, and biological processes that occur within it, and the development of theories and theories that explain the phenomena and behavior of the natural world. [end of text]
llama_print_timings: load time = 243.34 ms
llama_print_timings: sample time = 1.65 ms / 52 runs ( 0.03 ms per token, 31534.26 tokens per second)
llama_print_timings: prompt eval time = 118.96 ms / 6 tokens ( 19.83 ms per token, 50.44 tokens per second)
llama_print_timings: eval time = 3498.31 ms / 51 runs ( 68.59 ms per token, 14.58 tokens per second)
llama_print_timings: total time = 3700.79 ms / 57 tokens
Log end
When I try to run the server I still got error. Looks like there is no way to use it in python...
llama_new_context_with_model: KV self size = 18.00 MiB, K (f16): 9.00 MiB, V (f16): 9.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.25 MiB
llama_new_context_with_model: CPU compute buffer size = 45.50 MiB
llama_new_context_with_model: graph nodes = 425
llama_new_context_with_model: graph splits = 1
examples/server/server.cpp:696: GGML_ASSERT(llama_add_eos_token(model) != 1) failed
How did you do it?
Thank you for your appreciation, Fabio!
About the server issue, I noticed you already reported it on llama.cpp repo and received helpful replies. I hope they can solve it!