updated readme
Browse files- quantization summary
- noted tooling used in quantization.
- updated recommendations
- added note on ppl measurement
README.md
CHANGED
@@ -47,9 +47,10 @@ source repo: [BSC-LT/salamandra](/BSC-LT/salamandra-2b)
|
|
47 |
|
48 |
# **Quantization Summary**
|
49 |
|
50 |
-
The base model was quantized with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset.
|
51 |
|
52 |
-
- **
|
|
|
53 |
- **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
|
54 |
- **Q6_K**: Provides near-**bf16** performance with size savings.
|
55 |
|
@@ -63,8 +64,7 @@ The base model was quantized with a substantive importance matrix over all targe
|
|
63 |
|-----------------------|------------|--------------------------|---------------|----------------------------------------------------------------|
|
64 |
| [**IQ3_M**](salamandra-2b_IQ3_M.gguf) | 15.1995 | 0.079131 | 1.7G | Good size efficiency with acceptable PPL increase |
|
65 |
| [**Q3_K_L**](salamandra-2b_Q3_K_L.gguf) | 15.0444 | 0.068875 | 1.8G | Further size reduction with modest PPL increase |
|
66 |
-
| [**
|
67 |
-
| [**Q4_K_M**](salamandra-2b_Q4_K_M.gguf) | 14.399 | 0.025028 | 2.0G | Smaller with acceptable PPL |
|
68 |
| [**Q5_K_M**](salamandra-2b_Q5_K_M.gguf) | 14.1299 | 0.006162 | 2.2G | Excellent balance of PPL and size (**recommended**) |
|
69 |
| [**Q6_K**](salamandra-2b_Q6_K.gguf) | 14.0675 | 0.001736 | 2.4G | Nearly lossless performance with reduced size |
|
70 |
| [**bf16**](salamandra-2b_bf16.gguf) | 14.0431 | 0.0 | 4.2G | Baseline |
|
@@ -74,13 +74,11 @@ The base model was quantized with a substantive importance matrix over all targe
|
|
74 |
### **Notes:**
|
75 |
|
76 |
- **Recommended Quantizations:**
|
77 |
-
- **
|
78 |
- **Q5_K_M**: Offers the best balance between low perplexity and reduced file size above **Q4**, making it ideal for most applications.
|
79 |
-
- **Q6_K**: Delivers nearly lossless performance compared to **bf16** with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
|
80 |
- **Non-recommended Quanizations:**
|
81 |
-
- **IQ3_M**: Offers a smaller file size (1.7G) with an acceptable PPL increase
|
82 |
-
- **Q3_K_L**: Provides a slightly larger file size (1.8G) with an even better PPL
|
83 |
-
- **Q4_K_M**: While the **Q4_K_M** model is not designated as "recommended", it is highly suitable for architectures like **Metal**, which run **I-quant** models slowly. For such architectures, **Q4_K_M** remains an excellent choice.
|
84 |
- **Q6_K** Similar to Q8_0, offers very close perplexity to bf16. Given its smaller file size than Q8_0 (2.4G vs. 2.7G), Q6_K provides a better size-to-performance trade-off. It was selected because it is nearly lossless and less than 2.5GB.
|
85 |
- An attempt was made to get a model below 1.5GB, using **IQ2_XS**, but it was slightly above that size and its perplexity was clearly unacceptable (more than double the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
|
86 |
|
@@ -91,7 +89,7 @@ The base model was quantized with a substantive importance matrix over all targe
|
|
91 |
The selection of recommended models is designed to provide a spectrum of options that meet the following criteria:
|
92 |
|
93 |
- **Diversity in Quantization Types:**
|
94 |
-
- **I Quantization Below Q4:** **
|
95 |
- **K Quantization At and Above Q4:** **Q4_K_M**, **Q5_K_M**, and **Q6_K** provide K quantization options at and above the **Q4** level, giving users choices based on their specific needs.
|
96 |
- **Highly Compressed Quantization (Q3 and below):** **IQ3_M** and **Q3_K_L** are included as they meet the selection criteria of log PPL diff <0.3 and are not redundant with other models.
|
97 |
|
@@ -99,6 +97,8 @@ The selection of recommended models is designed to provide a spectrum of options
|
|
99 |
- **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
|
100 |
- **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **IQ3_XS** (1.7G) are excluded due to overlapping file sizes and comparable PPL, ensuring a sparse yet comprehensive selection.
|
101 |
|
|
|
|
|
102 |
|
103 |
![](./images/salamandra_header.png)
|
104 |
|
|
|
47 |
|
48 |
# **Quantization Summary**
|
49 |
|
50 |
+
The base model was quantized in [llama.cpp](https://github.com/ggerganov/llama.cpp) with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset. Logs of the process are included.
|
51 |
|
52 |
+
- **IQ3_M**: At <1.8GB, the smallest model worth highlighting.
|
53 |
+
- **Q4_K_S**: Good size reduction with minimal PPL impact.
|
54 |
- **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
|
55 |
- **Q6_K**: Provides near-**bf16** performance with size savings.
|
56 |
|
|
|
64 |
|-----------------------|------------|--------------------------|---------------|----------------------------------------------------------------|
|
65 |
| [**IQ3_M**](salamandra-2b_IQ3_M.gguf) | 15.1995 | 0.079131 | 1.7G | Good size efficiency with acceptable PPL increase |
|
66 |
| [**Q3_K_L**](salamandra-2b_Q3_K_L.gguf) | 15.0444 | 0.068875 | 1.8G | Further size reduction with modest PPL increase |
|
67 |
+
| [**Q4_K_S**](salamandra-2b_Q4_K_S.gguf) | 14.4338 | 0.027442 | 1.9G | Good size reduction with minimal PPL impact (**recommended**) |
|
|
|
68 |
| [**Q5_K_M**](salamandra-2b_Q5_K_M.gguf) | 14.1299 | 0.006162 | 2.2G | Excellent balance of PPL and size (**recommended**) |
|
69 |
| [**Q6_K**](salamandra-2b_Q6_K.gguf) | 14.0675 | 0.001736 | 2.4G | Nearly lossless performance with reduced size |
|
70 |
| [**bf16**](salamandra-2b_bf16.gguf) | 14.0431 | 0.0 | 4.2G | Baseline |
|
|
|
74 |
### **Notes:**
|
75 |
|
76 |
- **Recommended Quantizations:**
|
77 |
+
- **Q4_K_S**: Represents the best of the quantization types at/below **Q4** and less than 2GB, achieving good size efficiency while maintaining low perplexity.
|
78 |
- **Q5_K_M**: Offers the best balance between low perplexity and reduced file size above **Q4**, making it ideal for most applications.
|
|
|
79 |
- **Non-recommended Quanizations:**
|
80 |
+
- **IQ3_M**: Offers a smaller file size (1.7G) with an acceptable PPL increase, best among models below 1.8GB. A solid choice of the highly compressed models.
|
81 |
+
- **Q3_K_L**: Provides a slightly larger file size (1.8G) than IQ3_M, with an even better PPL.
|
|
|
82 |
- **Q6_K** Similar to Q8_0, offers very close perplexity to bf16. Given its smaller file size than Q8_0 (2.4G vs. 2.7G), Q6_K provides a better size-to-performance trade-off. It was selected because it is nearly lossless and less than 2.5GB.
|
83 |
- An attempt was made to get a model below 1.5GB, using **IQ2_XS**, but it was slightly above that size and its perplexity was clearly unacceptable (more than double the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
|
84 |
|
|
|
89 |
The selection of recommended models is designed to provide a spectrum of options that meet the following criteria:
|
90 |
|
91 |
- **Diversity in Quantization Types:**
|
92 |
+
- **I Quantization Below Q4:** **IQ3_M** is included to offer an option that uses I quantization below the **Q4** level, balancing size and performance.
|
93 |
- **K Quantization At and Above Q4:** **Q4_K_M**, **Q5_K_M**, and **Q6_K** provide K quantization options at and above the **Q4** level, giving users choices based on their specific needs.
|
94 |
- **Highly Compressed Quantization (Q3 and below):** **IQ3_M** and **Q3_K_L** are included as they meet the selection criteria of log PPL diff <0.3 and are not redundant with other models.
|
95 |
|
|
|
97 |
- **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
|
98 |
- **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **IQ3_XS** (1.7G) are excluded due to overlapping file sizes and comparable PPL, ensuring a sparse yet comprehensive selection.
|
99 |
|
100 |
+
PPL is measured (with `llama-perplexity`) from a sample of 50 of each language from the same dataset used to calculate the importance matrix.
|
101 |
+
|
102 |
|
103 |
![](./images/salamandra_header.png)
|
104 |
|