robbiemu
/

salamandra-2b-instruct

@@ -45,14 +45,24 @@ language:
 ---
 source repo: [BSC-LT/salamandra-2b-instruct](https://huggingface.co/BSC-LT/salamandra-2b-instruct)
-# Quantization summary
-The base model was quantized with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset.
 | **Quantization Type** | **PPL(Q)** | **ln(PPL(Q)/PPL(bf16))** | **File Size (G)** | **Notes**                                                      |
 |-----------------------|------------|------------------------|-------------------|----------------------------------------------------------------|
 | [**IQ3_M**](salamandra-2b-instruct_IQ3_M.gguf)             | 16.774     | 0.086769               | 1.7               | Good size efficiency with acceptable PPL increase              |
 | [**Q3_K_L**](salamandra-2b-instruct_Q3_K_L.gguf)            | 16.5067    | 0.070705               | 1.8               | Further size reduction with modest PPL increase                |
 | [**Q4_K_S**](salamandra-2b-instruct_Q4_K_S.gguf)            | 15.9346    | 0.035431               | 1.9               | Good size reduction with minimal PPL impact (**recommended**)  |
 | [**Q5_K_M**](salamandra-2b-instruct_Q5_K_M.gguf)            | 15.4746    | 0.006139               | 2.2               | Excellent balance of PPL and size (**recommended**)            |
 | [**Q6_K**](salamandra-2b-instruct_Q6_K.gguf)                | 15.3961    | 0.001053               | 2.4               | Nearly lossless performance with reduced size                  |
@@ -61,12 +71,13 @@ The base model was quantized with a substantive importance matrix over all targe
 ### **Notes:**
 - **Recommended Quantizations:**
-  - **Q4_K_S:** Although it offers good size reduction with minimal PPL impact, it is superseded by more optimal choices like Q5_K_M and Q6_K, but it is the only model with minimal PPL impact below 2GB.
   - **Q5_K_M:** Offers the best balance between low perplexity and reduced file size above Q4, making it ideal for most applications.
-  - **Q6_K:** Delivers nearly lossless performance compared to bf16 with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
 - **Non-recommended Quantizations:**
   - **IQ3_M:** Represents the best of the I quantization types below Q4, achieving good size efficiency while maintaining low perplexity.
   - **Q3_K_L:** Provides a slightly larger file size (1.8G) with an acceptable PPL (16.5067). While it meets the log PPL difference criteria, it is not as balanced as the recommended quantizations.
 - An attempt was made to get a model below **IQ3_M** size, but perplexity was unacceptable even with **IQ2_M** (more than the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-instruct-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
 ---
@@ -84,6 +95,8 @@ The selection of recommended models is designed to provide a spectrum of options
   - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
   - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **Q3_K_M** (1.7G) are excluded due to nearly equal file sizes and differing PPL, ensuring a sparse yet comprehensive selection.
 ---
 # Comparison of salamandra 2b/instruct quantization results

 ---
 source repo: [BSC-LT/salamandra-2b-instruct](https://huggingface.co/BSC-LT/salamandra-2b-instruct)
+# **Quantization summary**
+The base model was quantized in [llama.cpp](https://github.com/ggerganov/llama.cpp) with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset. Logs of the process are included.
+- **IQ3_M**: At <1.8GB, the smallest model worth highlighting.
+- **IQ4_XS** or **Q4_K_S**: Its a toss up for the sub-2GB quantizations. Metal users will get more t/s from Q4_K_S.
+- **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
+- **Q6_K**: Provides near-**bf16** performance with size savings.
+---
+# Quantization
 | **Quantization Type** | **PPL(Q)** | **ln(PPL(Q)/PPL(bf16))** | **File Size (G)** | **Notes**                                                      |
 |-----------------------|------------|------------------------|-------------------|----------------------------------------------------------------|
 | [**IQ3_M**](salamandra-2b-instruct_IQ3_M.gguf)             | 16.774     | 0.086769               | 1.7               | Good size efficiency with acceptable PPL increase              |
 | [**Q3_K_L**](salamandra-2b-instruct_Q3_K_L.gguf)            | 16.5067    | 0.070705               | 1.8               | Further size reduction with modest PPL increase                |
+| [**IQ4_XS**](salamandra-2b-instruct_IQ4_XS.gguf)            | 15.9591 | 0.036968            | 1.8           | Good size reduction with acceptable PPL increase (**recommended**)              |
 | [**Q4_K_S**](salamandra-2b-instruct_Q4_K_S.gguf)            | 15.9346    | 0.035431               | 1.9               | Good size reduction with minimal PPL impact (**recommended**)  |
 | [**Q5_K_M**](salamandra-2b-instruct_Q5_K_M.gguf)            | 15.4746    | 0.006139               | 2.2               | Excellent balance of PPL and size (**recommended**)            |
 | [**Q6_K**](salamandra-2b-instruct_Q6_K.gguf)                | 15.3961    | 0.001053               | 2.4               | Nearly lossless performance with reduced size                  |
 ### **Notes:**
 - **Recommended Quantizations:**
+  - **IQ4_XL:** A good size reduction with minimal PPL impact. The filesize is actually very close to 1.9GB, so not much different from Q4_K_S.
+  - **Q4_K_S:** A good size reduction with minimal PPL impact.
   - **Q5_K_M:** Offers the best balance between low perplexity and reduced file size above Q4, making it ideal for most applications.
 - **Non-recommended Quantizations:**
   - **IQ3_M:** Represents the best of the I quantization types below Q4, achieving good size efficiency while maintaining low perplexity.
   - **Q3_K_L:** Provides a slightly larger file size (1.8G) with an acceptable PPL (16.5067). While it meets the log PPL difference criteria, it is not as balanced as the recommended quantizations.
+  - **Q6_K:** Delivers nearly lossless performance compared to bf16 with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
 - An attempt was made to get a model below **IQ3_M** size, but perplexity was unacceptable even with **IQ2_M** (more than the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-instruct-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
 ---
   - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
   - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **Q3_K_M** (1.7G) are excluded due to nearly equal file sizes and differing PPL, ensuring a sparse yet comprehensive selection.
+PPL is measured from a sample of 50 of each language from the same dataset used to calculate the importance matrix.
 ---
 # Comparison of salamandra 2b/instruct quantization results