MPS support quantification

#39

by tonimelisma - opened Apr 20

Apr 20

I'm trying to run this with the transformers library on an M1 Macbook Pro.

With bfloat16, I get:
"TypeError: BFloat16 is not supported on MPS"

With float16, I get:
"NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS."

Is there a quantized model somewhere that I should be using instead? Any chance of running this model on Apple GPU with the hugging face libraries?

tonimelisma changed discussion title from MPS support quantification to Xxx Apr 20

tonimelisma changed discussion title from Xxx to MPS support quantification Apr 20

rileydean

May 13

Curious, did you ever get this working?

ybelkada

May 14

Hi @tonimelisma
For using quantized Llama on apple devices, I advise to use MLX: https://huggingface.co/collections/mlx-community/llama-3-662156b069a5d33b3328603c cc @awni @prince-canuma

awni

May 14

Yup, should be easy to do and reasonably fast with MLX:

pip install mlx-lm
mlx_lm.generate --model mlx-community/Meta-Llama-3-8B-Instruct-4bit --prompt "hello"

More docs here

tonimelisma

May 15

Yes, MLX and llama.cpp work fine. I was inquiring whether Huggingface would work, too.

mantrid-prime

Jun 14

For mps you need to use torch.float32

A lot of things need changed elsewhere but this solves this particular issue. It's probably safe to assume that you need llama.cpp to run on a mac.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment