Why is it so slow, even on GPU
I'm comparing Cerbero 7B to Llama2 /B.
Even if I'm using a 2 A10 GPU VM Cerbero is surprisingly slow. I have checked (gpustat) it is running on GPU... but getting an answer takes minutes.
Is it a fine tuning of Mistral 7B or do we have architectural changes requiring much more computing power? (I know, it has been trained on A100 8 GPU... big shape)
Thats weird it should have almost the same performance as llama2-7b, can you share some details on your setup?
Test con max_new_token=128 Wall time: 1min 54s
I'm using a 2 A10 GPU VM (24 + 24 GB GPU mem).... on the same setting Lalma2 (running locally) usually answers in 5 sec
The chain is built using Llama-index
With gpustat I have checked that everything runs on GPU.
I can give you more context on what I'm doing: I'm evaluating several LLM with a RAG approach, on Italian language.
I have tested:
- Cohere
-Llama 2 /B e 13B - MIstral 7B
- now testing Cerbero
To build the RA chain I'm using Llamaidex. To evaluate Trulens-eval
The biggest problem I see is alway code-switching (question in Itaian, answer in Englis, let's say 20% of the cases)
Cerbero seems a little better... but it is really slow (10X the latency)
Really it puzzles me also the fact that this model, if compared with lalama2 /B and MIstral 7B takes so much GPU mem.
Try to load the float16
variant, the default one is float32
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")
check model.dtype
it should be float16
Try to load the
float16
variant, the default one isfloat32
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")
check
model.dtype
it should befloat16
It is exactly what I have already done. No way... it takes minutes.
I have suspect that Cerbero makes a lot of effort processing long input. I'm using a RAG approach taking context from some English book. So the prompt to the model is always several hundreds of tokens (even thousands) long. Have you tested it with long inputs?
Well, on the 8GPU box described in the docs it will be fast.
Next week i will perform some inference benchmarks on smaller GPUs (i also have some A30 a T4) and we will get to the bottom of this. In the meantime maybe you can try the llama.cpp
version. llama.cpp
has a lot of optimizations and it is faster than transformers (despite the name, you can use the GPU with llama.cpp)