Why this model is extremely slow on M1 Max with 32GB RAM
the weights are no bigger than 32G.
The model itself will require over 25GB of RAM to load, leaving not much memory left for inference and system usage. As a result, swapping may be kicked in. To be honest, I wouldn't recommend running this model on a 32GB machine, better run it on machines with 64GB or more.
In that way, what is the best model 32GB machine could get? Something like LLAMA 13B variant?
You could try some 30B models with 4-bit quantization, which will require around 20GB . You still have some memory left for inference and the system.
Sorry for the late reply, you can use the memory calculator to check the model memory usage see which one would fit your machine: https://huggingface.co/spaces/hf-accelerate/model-memory-usage
@mzbac
Thank you, this thread is a little off course to this repo, I won't blame you.
I tried Mixtral 8x7B with Q3_k, it's doable, despite slow ( 3~4 tokens / second). The model itself is 24GB.
Maybe 2x7B would be a better choice, however there are no such model optimized for Chinese.
In the end, I find that the little speed up mlx promised doesn't justify the hassle it brings. There is a long way for this community to go.