Quantized Mistral-NeMo-Instruct-2407 versions for Prompt Sensitivity Blog

This repository contains four quantized versions of Mistral-NeMo-Instruct-2407, created using llama.cpp. The goal was to examine how different quantization methods affect prompt sensitivity with sentiment classification tasks.

Quantization Details

Models were quantized using llama.cpp (release b3922). The imatrix versions used an imatrix.dat file created from Bartowski's calibration dataset, mentioned here.

Models

Filename Size Description
Mistral-NeMo-12B-Instruct-2407-Q8_0.gguf 13 GB 8-bit default quantization
Mistral-NeMo-12B-Instruct-2407-Q5_0.gguf 8.73 GB 5-bit default quantization
Mistral-NeMo-12B-Instruct-2407-imatrix-Q8_0.gguf 13 GB 8-bit with imatrix quantization
Mistral-NeMo-12B-Instruct-2407-imatrix-Q5_0.gguf 8.73 GB 5-bit with imatrix quantization

I've also included the imatrix.dat (7.05 MB) file used to create the imatrix-quantized versions.

Findings

Prompt sensitivity was seen specifically in 5-bit models using imatrix quantization, but not with default llama.cpp quantization settings. Prompt sensitivity was not observed in 8-bit models with either quantization method.

For further discussion please see my accompanying blog post.

Author

Simon Barnes

Downloads last month
91
GGUF
Model size
12.2B params
Architecture
llama

5-bit

8-bit

Inference API
Unable to determine this model's library. Check the docs .

Model tree for NLPoetic/Mistral-NeMo-Instruct-2407-GGUF

Quantized
(88)
this model