Why this model is extremely slow on M1 Max with 32GB RAM

by bltfqx - opened Feb 27

Feb 27

the weights are no bigger than 32G.

bltfqx changed discussion title from Why is model is extremely slow on M1 Max with 32GB RAM to Why this model is extremely slow on M1 Max with 32GB RAM Feb 27

mzbac

MLX Community org Feb 29

The model itself will require over 25GB of RAM to load, leaving not much memory left for inference and system usage. As a result, swapping may be kicked in. To be honest, I wouldn't recommend running this model on a 32GB machine, better run it on machines with 64GB or more.

bltfqx

Feb 29

In that way, what is the best model 32GB machine could get? Something like LLAMA 13B variant?

mzbac

MLX Community org Feb 29

You could try some 30B models with 4-bit quantization, which will require around 20GB . You still have some memory left for inference and the system.

bltfqx

Feb 29

@mzbac what about TomGrc/FusionNet_7Bx2_MoE_14B ? Yes, I'm a little obsessed with MoE.

mzbac

MLX Community org Mar 7

Sorry for the late reply, you can use the memory calculator to check the model memory usage see which one would fit your machine: https://huggingface.co/spaces/hf-accelerate/model-memory-usage

bltfqx

Mar 8

@mzbac Thank you, this thread is a little off course to this repo, I won't blame you.
I tried Mixtral 8x7B with Q3_k, it's doable, despite slow ( 3~4 tokens / second). The model itself is 24GB.
Maybe 2x7B would be a better choice, however there are no such model optimized for Chinese.

In the end, I find that the little speed up mlx promised doesn't justify the hassle it brings. There is a long way for this community to go.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment