Runtime is about 2x slower than with Meta's own audiocraft code
The local gradio demo provided by Meta here using the non-HF weights is about 2x faster than the HF weights. Worse, bitsandbytes quantization results in 3-4x slower inference when it should be faster. Looks like the Transformers implementation still needs some work.
Hi @lemonflourorange - Sorry to hear that. Can you please share the inference code you are using?
Hey
@lemonflourorange
! Thanks for opening this issue. Note that the default Meta implementation uses fp16 by default, whereas transformers
uses fp32. You can put the model in fp16 precision by calling:
model.half()
This should give you a nice speed-up versus full fp32 precision.
Note that bitsandbytes quantisation is expected to be slower than fp16; 3-4x slower is about what we'd expect for dynamic 8-bit quantisation (this will be lower for dynamic 4-bit quantisation). See results for Whisper ASR, which tests models for a similar model size for the speech recognition task: https://github.com/huggingface/peft/discussions/477
Thanks. Half precision fixes this for me. Still not sure why quantization ends up being slower than fp16. I guess quantization only improves inference speed in LLMs?
Indeed - we have (relatively) small matmuls, and large inputs, which causes the 8-bit bnb algorithm to be quite slow for MusicGen. You can try using the latest 4-bit algorithm which should be faster: https://huggingface.co/blog/4bit-transformers-bitsandbytes