mistralai/Mixtral-8x7B-Instruct-v0.1 · Discuss benefits of this work

@Starlento thats because this is a moe model.
you see this is made of 8 7b param models trained on different data. a easy way to think about it is like one trained on science stuff, another is trained on math stuff, and another is trained on roleplay. They are probably not trained like this but its somewhat similar.

You might say 8x7 does not equal 47, and the reason mixtral is 47b parameters is because some of the parameters are shared so its actually 47b.

The reason it uses so much vram but at 13b inference speed is because of its architecture.
You MUST load all the models so it will take a very large amount of vram(same as a normal 47b model)

However, when doing actual inference you just need to use 2 of the best models suited to answer the question. so 2x7 = 14b, so roughly 13b speed.

The 2 models might change depending on the instruction you input so all the 8 models have to be preloaded before.
Mixtral is excellent for its size and performs really well at instruction tasks. It has decent benchmark scores so far but it can be easily increased much more when the community can finetune it even further.