Edit model card

This repository contains alternative Mixtral-instruct-8x7B (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization, but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q2_K mixtral-instruct-8x7b-q2k.gguf 6.8953 56.4% 5.2679 19.5%
Q3_K_S mixtral-instruct-8x7b-q3k-small.gguf 4.7038 6.68% 4.6401 5.24%
Q3_K_M mixtral-instruct-8x7b-q3k-medium.gguf 4.6663 5.83% 4.5608 3.44%
Q4_K_S mixtral-instruct-8x7b-q4k-small.gguf 4.5105 2.30% 4.4630 1.22%
Q4_K_M mixtral-instruct-8x7b-q4k-medium.gguf 4.5105 2.30% 4.4568 1.08%
Q5_K_S mixtral-instruct-8x7b-q5k-small.gguf 4.4402 0.71% 4.4277 0.42%
Q4_0 mixtral-instruct-8x7b-q40.gguf 4.5102 2.29% 4.4908 1.85%
Q4_1 mixtral-instruct-8x7b-q41.gguf 4.5415 3.00% 4.4612 1.18%
Q5_0 mixtral-instruct-8x7b-q50.gguf 4.4361 0.61% 4.4297 0.47%
Downloads last month
232
GGUF
Model size
46.7B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .