Abstract
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
Community
Impressive, very nice.
MoE routing & offloading (https://arxiv.org/pdf/2312.17238.pdf), can expert selection be predicted ahead of time by analyzing the token sequence and prefetched if not in cache? Similar to speculative decoding with smaller model?
Thank you for sharing this wonderful jobs. I just learned this from HF member.
There are two questions I want to verify:
- expert locality and caching : from experiment, it seems that the expert cached with top-2 LRU cache is not aligned with the expert probably selected again (see small grey square):
We tried to guess expert to use and load them : do we use a function cheaper than "gating" to guess (applying next layer’s gating function to previous layer’s hidden states)? and do we have guess accuracy experiments and is there any overhead if guess is wrong ?
"quantizing experts to a lower bitwidth, while keeping all non-expert layers at 4-bit" : I just checked paper, do we have dynamic range for each layer in a typical dataset benchmark test ?
Thank you for this job.
@sandyasm
.
Introduces Mixtral 8x7B: a sparse mixture of experts (SMoE) model; same as Mistral 7B (decoder only), but each layer has 8 FF blocks (experts) and a router selects two experts for the next stage (it selects parameters, and the weight for output, which is a weighted sum). Gating network is a softmax over top-k logits through a linear layer (k is top-k experts, input of top-k is n-dim which is number of experts, output is pass through if in top-k and negative infinity otherwise) - equation in section 2.1; replace FFN sub-block of transformer with MoE block (uses SwiGLU blocks). SMoE makes sparse parameter count (actual number of parameters when training) different from active parameter count (during inference) as gates with zero output don’t need forward pass of expert; larger models with lower active parameter means better throughput. Better than LLaMA 2 70B on math and code generation, close on AGI Eval and BBH; Mixtral 8x7B has only 13B active parameters but beats LLaMA 2 13B on all benchmarks (common reasoning, world knowledge, reading comprehension, math, code, and aggregated results - section 3). Beats GPT-3.5 on HellaSwag and WinoGrande; has good long context performance. Trained Mixtral Instruct using supervised fine-tuning on an instruction dataset followed by direct preference optimization (DPO); beats Gemini Pro and Claude-2.1 (and is the only open source model in top-10). Has visualizations for layer selections for each token to notice if gates learn human-recognizable domain distinction, no specific pattern found (except in code and math). From Mistral AI.
Links: News (HuggingFace Blog), GitHub (vLLM project)
Revolutionizing Language Models: Mixtral's Sparse Mixture of Experts Unveiled
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/