updated readme

Browse files

- quantization summary
- noted tooling used in quantization.
- updated recommendations
- added note on ppl measurement

Files changed (1) hide show

README.md +10 -10

README.md CHANGED Viewed

@@ -47,9 +47,10 @@ source repo: [BSC-LT/salamandra](/BSC-LT/salamandra-2b)
 # **Quantization Summary**
-The base model was quantized with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset.
-- **IQ4_NL**: Best I quantization below **Q4** with minimal PPL impact.
 - **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
 - **Q6_K**: Provides near-**bf16** performance with size savings.
@@ -63,8 +64,7 @@ The base model was quantized with a substantive importance matrix over all targe
 |-----------------------|------------|--------------------------|---------------|----------------------------------------------------------------|
 | [**IQ3_M**](salamandra-2b_IQ3_M.gguf)             | 15.1995    | 0.079131                 | 1.7G          | Good size efficiency with acceptable PPL increase              |
 | [**Q3_K_L**](salamandra-2b_Q3_K_L.gguf)            | 15.0444    | 0.068875                 | 1.8G          | Further size reduction with modest PPL increase                |
-| [**IQ4_NL**](salamandra-2b_IQ4_NL.gguf)            | 14.5534    | 0.035693                 | 1.9G          | Good size reduction with minimal PPL impact (**recommended**)  |
-| [**Q4_K_M**](salamandra-2b_Q4_K_M.gguf)            | 14.399     | 0.025028                 | 2.0G          | Smaller with acceptable PPL  |
 | [**Q5_K_M**](salamandra-2b_Q5_K_M.gguf)            | 14.1299    | 0.006162                 | 2.2G          | Excellent balance of PPL and size (**recommended**)            |
 | [**Q6_K**](salamandra-2b_Q6_K.gguf)              | 14.0675    | 0.001736                 | 2.4G          | Nearly lossless performance with reduced size |
 | [**bf16**](salamandra-2b_bf16.gguf)              | 14.0431    | 0.0                      | 4.2G          | Baseline                                                       |
@@ -74,13 +74,11 @@ The base model was quantized with a substantive importance matrix over all targe
 ### **Notes:**
 - **Recommended Quantizations:**
-  - **IQ4_NL**: Represents the best of the I quantization types below **Q4**, achieving good size efficiency while maintaining low perplexity.
   - **Q5_K_M**: Offers the best balance between low perplexity and reduced file size above **Q4**, making it ideal for most applications.
-  - **Q6_K**: Delivers nearly lossless performance compared to **bf16** with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
 - **Non-recommended Quanizations:**
-  - **IQ3_M**: Offers a smaller file size (1.7G) with an acceptable PPL increase (15.1995). A solid choice of the highly compressed models.
-  - **Q3_K_L**: Provides a slightly larger file size (1.8G) with an even better PPL (15.0444). Fits within the selection criteria for highly compressed models with log PPL diff <0.3.
-  - **Q4_K_M**: While the **Q4_K_M** model is not designated as "recommended", it is highly suitable for architectures like **Metal**, which run **I-quant** models slowly. For such architectures, **Q4_K_M** remains an excellent choice.
   - **Q6_K** Similar to Q8_0, offers very close perplexity to bf16. Given its smaller file size than Q8_0 (2.4G vs. 2.7G), Q6_K provides a better size-to-performance trade-off. It was selected because it is nearly lossless and less than 2.5GB.
 - An attempt was made to get a model below 1.5GB, using **IQ2_XS**, but it was slightly above that size and its perplexity was clearly unacceptable (more than double the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
@@ -91,7 +89,7 @@ The base model was quantized with a substantive importance matrix over all targe
 The selection of recommended models is designed to provide a spectrum of options that meet the following criteria:
 - **Diversity in Quantization Types:**
-  - **I Quantization Below Q4:** **IQ4_NL** is included to offer an option that uses I quantization below the **Q4** level, balancing size and performance.
   - **K Quantization At and Above Q4:** **Q4_K_M**, **Q5_K_M**, and **Q6_K** provide K quantization options at and above the **Q4** level, giving users choices based on their specific needs.
   - **Highly Compressed Quantization (Q3 and below):** **IQ3_M** and **Q3_K_L** are included as they meet the selection criteria of log PPL diff <0.3 and are not redundant with other models.
@@ -99,6 +97,8 @@ The selection of recommended models is designed to provide a spectrum of options
   - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
   - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **IQ3_XS** (1.7G) are excluded due to overlapping file sizes and comparable PPL, ensuring a sparse yet comprehensive selection.
 ![](./images/salamandra_header.png)

 # **Quantization Summary**
+The base model was quantized in [llama.cpp](https://github.com/ggerganov/llama.cpp) with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset. Logs of the process are included.
+- **IQ3_M**: At <1.8GB, the smallest model worth highlighting.
+- **Q4_K_S**: Good size reduction with minimal PPL impact.
 - **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
 - **Q6_K**: Provides near-**bf16** performance with size savings.
 |-----------------------|------------|--------------------------|---------------|----------------------------------------------------------------|
 | [**IQ3_M**](salamandra-2b_IQ3_M.gguf)             | 15.1995    | 0.079131                 | 1.7G          | Good size efficiency with acceptable PPL increase              |
 | [**Q3_K_L**](salamandra-2b_Q3_K_L.gguf)            | 15.0444    | 0.068875                 | 1.8G          | Further size reduction with modest PPL increase                |
+| [**Q4_K_S**](salamandra-2b_Q4_K_S.gguf)            | 14.4338   | 0.027442                 | 1.9G          | Good size reduction with minimal PPL impact (**recommended**)  |
 | [**Q5_K_M**](salamandra-2b_Q5_K_M.gguf)            | 14.1299    | 0.006162                 | 2.2G          | Excellent balance of PPL and size (**recommended**)            |
 | [**Q6_K**](salamandra-2b_Q6_K.gguf)              | 14.0675    | 0.001736                 | 2.4G          | Nearly lossless performance with reduced size |
 | [**bf16**](salamandra-2b_bf16.gguf)              | 14.0431    | 0.0                      | 4.2G          | Baseline                                                       |
 ### **Notes:**
 - **Recommended Quantizations:**
+  - **Q4_K_S**: Represents the best of the quantization types at/below **Q4** and less than 2GB, achieving good size efficiency while maintaining low perplexity.
   - **Q5_K_M**: Offers the best balance between low perplexity and reduced file size above **Q4**, making it ideal for most applications.
 - **Non-recommended Quanizations:**
+  - **IQ3_M**: Offers a smaller file size (1.7G) with an acceptable PPL increase, best among models below 1.8GB. A solid choice of the highly compressed models.
+  - **Q3_K_L**: Provides a slightly larger file size (1.8G) than IQ3_M, with an even better PPL.
   - **Q6_K** Similar to Q8_0, offers very close perplexity to bf16. Given its smaller file size than Q8_0 (2.4G vs. 2.7G), Q6_K provides a better size-to-performance trade-off. It was selected because it is nearly lossless and less than 2.5GB.
 - An attempt was made to get a model below 1.5GB, using **IQ2_XS**, but it was slightly above that size and its perplexity was clearly unacceptable (more than double the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
 The selection of recommended models is designed to provide a spectrum of options that meet the following criteria:
 - **Diversity in Quantization Types:**
+  - **I Quantization Below Q4:** **IQ3_M** is included to offer an option that uses I quantization below the **Q4** level, balancing size and performance.
   - **K Quantization At and Above Q4:** **Q4_K_M**, **Q5_K_M**, and **Q6_K** provide K quantization options at and above the **Q4** level, giving users choices based on their specific needs.
   - **Highly Compressed Quantization (Q3 and below):** **IQ3_M** and **Q3_K_L** are included as they meet the selection criteria of log PPL diff <0.3 and are not redundant with other models.
   - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
   - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **IQ3_XS** (1.7G) are excluded due to overlapping file sizes and comparable PPL, ensuring a sparse yet comprehensive selection.
+PPL is measured (with `llama-perplexity`) from a sample of 50 of each language from the same dataset used to calculate the importance matrix.
 ![](./images/salamandra_header.png)