MLX
mixtral
Mixture of Experts

Why this model is extremely slow on M1 Max with 32GB RAM

#3
by bltfqx - opened

the weights are no bigger than 32G.

bltfqx changed discussion title from Why is model is extremely slow on M1 Max with 32GB RAM to Why this model is extremely slow on M1 Max with 32GB RAM
MLX Community org

The model itself will require over 25GB of RAM to load, leaving not much memory left for inference and system usage. As a result, swapping may be kicked in. To be honest, I wouldn't recommend running this model on a 32GB machine, better run it on machines with 64GB or more.

In that way, what is the best model 32GB machine could get? Something like LLAMA 13B variant?

MLX Community org

You could try some 30B models with 4-bit quantization, which will require around 20GB . You still have some memory left for inference and the system.

@mzbac what about TomGrc/FusionNet_7Bx2_MoE_14B ? Yes, I'm a little obsessed with MoE.

MLX Community org

Sorry for the late reply, you can use the memory calculator to check the model memory usage see which one would fit your machine: https://huggingface.co/spaces/hf-accelerate/model-memory-usage

@mzbac Thank you, this thread is a little off course to this repo, I won't blame you.
I tried Mixtral 8x7B with Q3_k, it's doable, despite slow ( 3~4 tokens / second). The model itself is 24GB.
Maybe 2x7B would be a better choice, however there are no such model optimized for Chinese.

In the end, I find that the little speed up mlx promised doesn't justify the hassle it brings. There is a long way for this community to go.

Sign up or log in to comment