Transformers
Safetensors
English
llama
text-generation-inference
Inference Endpoints

fastest inference

#2
by ehartford - opened

Hi I would like advice about the fastest way to do inference with this?
I wanna run this on 5 million samples, it seems it will take several months, unless i find a faster way.

Nexusflow org

Hi @ehartford ,

I have found Deepspeed inference to be quite good for inferencing this model which allows you to use tensor parallelism.

Here are some links to get started:
https://deepspeed.readthedocs.io/en/latest/inference-init.html
https://www.deepspeed.ai/tutorials/inference-tutorial/

Also note that it is a bit faster to have the tokenizer pad to 'longest' rather than 'max_length'.

Hope this helps!

Sign up or log in to comment