Inference speed is extremly slow with FastChat
#22
by
oximi123
- opened
I use FastChat to deploy CodeLlama-7b-Instruct-hf on a A800-80GB server. The inference speed is extremly slow (It runs more than ten minutes without producing the response for a request). Any suggestion on how to solve this problem?
Here is how I deploy it with FastChat:
python -m fastchat.serve.controller
python -m fastchat.serve.model_worker --model-path /home/user/botao/CodeLlama-7b-Instruct-hf
python -m fastchat.serve.openai_api_server --host localhost --port 8000
Did you try with VLLM endpoint ?