Inference speed is extremly slow with FastChat

#22

by oximi123 - opened Dec 26, 2023

Dec 26, 2023

I use FastChat to deploy CodeLlama-7b-Instruct-hf on a A800-80GB server. The inference speed is extremly slow (It runs more than ten minutes without producing the response for a request). Any suggestion on how to solve this problem?

Here is how I deploy it with FastChat:

python -m fastchat.serve.controller
python -m fastchat.serve.model_worker --model-path /home/user/botao/CodeLlama-7b-Instruct-hf
python -m fastchat.serve.openai_api_server --host localhost --port 8000

Rahulmr42

Jun 20

Did you try with VLLM endpoint ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment