Why inference is very slow?

#17
by hanswang73 - opened

Nvidia A40, 48GB-GPU-Mem, 80GB-CPU-Mem
cuda 11.8
transformers == 4.31.0
8bit quantization
use TextIteratorStreamer for inference
the speed is about 1 token per second

Sign up or log in to comment