MPS support quantification

#39
by tonimelisma - opened

I'm trying to run this with the transformers library on an M1 Macbook Pro.

With bfloat16, I get:
"TypeError: BFloat16 is not supported on MPS"

With float16, I get:
"NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS."

Is there a quantized model somewhere that I should be using instead? Any chance of running this model on Apple GPU with the hugging face libraries?

tonimelisma changed discussion title from MPS support quantification to Xxx
tonimelisma changed discussion title from Xxx to MPS support quantification

Curious, did you ever get this working?

Hi @tonimelisma
For using quantized Llama on apple devices, I advise to use MLX: https://huggingface.co/collections/mlx-community/llama-3-662156b069a5d33b3328603c cc @awni @prince-canuma

Yup, should be easy to do and reasonably fast with MLX:

  1. pip install mlx-lm
  2. mlx_lm.generate --model mlx-community/Meta-Llama-3-8B-Instruct-4bit --prompt "hello"

More docs here

Yes, MLX and llama.cpp work fine. I was inquiring whether Huggingface would work, too.

For mps you need to use torch.float32

A lot of things need changed elsewhere but this solves this particular issue. It's probably safe to assume that you need llama.cpp to run on a mac.

Sign up or log in to comment