Text Generation
Transformers
PyTorch
Italian
English
mistral
conversational
text-generation-inference
Inference Endpoints

Why is it so slow, even on GPU

#3
by luigisaetta - opened

I'm comparing Cerbero 7B to Llama2 /B.
Even if I'm using a 2 A10 GPU VM Cerbero is surprisingly slow. I have checked (gpustat) it is running on GPU... but getting an answer takes minutes.
Is it a fine tuning of Mistral 7B or do we have architectural changes requiring much more computing power? (I know, it has been trained on A100 8 GPU... big shape)

Thats weird it should have almost the same performance as llama2-7b, can you share some details on your setup?

Test con max_new_token=128 Wall time: 1min 54s

I'm using a 2 A10 GPU VM (24 + 24 GB GPU mem).... on the same setting Lalma2 (running locally) usually answers in 5 sec
The chain is built using Llama-index
With gpustat I have checked that everything runs on GPU.

I can give you more context on what I'm doing: I'm evaluating several LLM with a RAG approach, on Italian language.
I have tested:

  • Cohere
    -Llama 2 /B e 13B
  • MIstral 7B
  • now testing Cerbero

To build the RA chain I'm using Llamaidex. To evaluate Trulens-eval

The biggest problem I see is alway code-switching (question in Itaian, answer in Englis, let's say 20% of the cases)

Cerbero seems a little better... but it is really slow (10X the latency)

Really it puzzles me also the fact that this model, if compared with lalama2 /B and MIstral 7B takes so much GPU mem.

Try to load the float16 variant, the default one is float32

model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")

check model.dtype it should be float16

Try to load the float16 variant, the default one is float32

model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")

check model.dtype it should be float16

It is exactly what I have already done. No way... it takes minutes.

I have suspect that Cerbero makes a lot of effort processing long input. I'm using a RAG approach taking context from some English book. So the prompt to the model is always several hundreds of tokens (even thousands) long. Have you tested it with long inputs?
Well, on the 8GPU box described in the docs it will be fast.

Next week i will perform some inference benchmarks on smaller GPUs (i also have some A30 a T4) and we will get to the bottom of this. In the meantime maybe you can try the llama.cpp version. llama.cpp has a lot of optimizations and it is faster than transformers (despite the name, you can use the GPU with llama.cpp)

galatolo changed discussion status to closed

Sign up or log in to comment