robbiemu commited on
Commit
8f4bb6e
1 Parent(s): 51b2d92

updated readme

Browse files

- quantization summary
- noted tooling used in quantization.
- updated recommendations
- added note on ppl measurement

Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -47,9 +47,10 @@ source repo: [BSC-LT/salamandra](/BSC-LT/salamandra-2b)
47
 
48
  # **Quantization Summary**
49
 
50
- The base model was quantized with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset.
51
 
52
- - **IQ4_NL**: Best I quantization below **Q4** with minimal PPL impact.
 
53
  - **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
54
  - **Q6_K**: Provides near-**bf16** performance with size savings.
55
 
@@ -63,8 +64,7 @@ The base model was quantized with a substantive importance matrix over all targe
63
  |-----------------------|------------|--------------------------|---------------|----------------------------------------------------------------|
64
  | [**IQ3_M**](salamandra-2b_IQ3_M.gguf) | 15.1995 | 0.079131 | 1.7G | Good size efficiency with acceptable PPL increase |
65
  | [**Q3_K_L**](salamandra-2b_Q3_K_L.gguf) | 15.0444 | 0.068875 | 1.8G | Further size reduction with modest PPL increase |
66
- | [**IQ4_NL**](salamandra-2b_IQ4_NL.gguf) | 14.5534 | 0.035693 | 1.9G | Good size reduction with minimal PPL impact (**recommended**) |
67
- | [**Q4_K_M**](salamandra-2b_Q4_K_M.gguf) | 14.399 | 0.025028 | 2.0G | Smaller with acceptable PPL |
68
  | [**Q5_K_M**](salamandra-2b_Q5_K_M.gguf) | 14.1299 | 0.006162 | 2.2G | Excellent balance of PPL and size (**recommended**) |
69
  | [**Q6_K**](salamandra-2b_Q6_K.gguf) | 14.0675 | 0.001736 | 2.4G | Nearly lossless performance with reduced size |
70
  | [**bf16**](salamandra-2b_bf16.gguf) | 14.0431 | 0.0 | 4.2G | Baseline |
@@ -74,13 +74,11 @@ The base model was quantized with a substantive importance matrix over all targe
74
  ### **Notes:**
75
 
76
  - **Recommended Quantizations:**
77
- - **IQ4_NL**: Represents the best of the I quantization types below **Q4**, achieving good size efficiency while maintaining low perplexity.
78
  - **Q5_K_M**: Offers the best balance between low perplexity and reduced file size above **Q4**, making it ideal for most applications.
79
- - **Q6_K**: Delivers nearly lossless performance compared to **bf16** with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
80
  - **Non-recommended Quanizations:**
81
- - **IQ3_M**: Offers a smaller file size (1.7G) with an acceptable PPL increase (15.1995). A solid choice of the highly compressed models.
82
- - **Q3_K_L**: Provides a slightly larger file size (1.8G) with an even better PPL (15.0444). Fits within the selection criteria for highly compressed models with log PPL diff <0.3.
83
- - **Q4_K_M**: While the **Q4_K_M** model is not designated as "recommended", it is highly suitable for architectures like **Metal**, which run **I-quant** models slowly. For such architectures, **Q4_K_M** remains an excellent choice.
84
  - **Q6_K** Similar to Q8_0, offers very close perplexity to bf16. Given its smaller file size than Q8_0 (2.4G vs. 2.7G), Q6_K provides a better size-to-performance trade-off. It was selected because it is nearly lossless and less than 2.5GB.
85
  - An attempt was made to get a model below 1.5GB, using **IQ2_XS**, but it was slightly above that size and its perplexity was clearly unacceptable (more than double the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
86
 
@@ -91,7 +89,7 @@ The base model was quantized with a substantive importance matrix over all targe
91
  The selection of recommended models is designed to provide a spectrum of options that meet the following criteria:
92
 
93
  - **Diversity in Quantization Types:**
94
- - **I Quantization Below Q4:** **IQ4_NL** is included to offer an option that uses I quantization below the **Q4** level, balancing size and performance.
95
  - **K Quantization At and Above Q4:** **Q4_K_M**, **Q5_K_M**, and **Q6_K** provide K quantization options at and above the **Q4** level, giving users choices based on their specific needs.
96
  - **Highly Compressed Quantization (Q3 and below):** **IQ3_M** and **Q3_K_L** are included as they meet the selection criteria of log PPL diff <0.3 and are not redundant with other models.
97
 
@@ -99,6 +97,8 @@ The selection of recommended models is designed to provide a spectrum of options
99
  - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
100
  - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **IQ3_XS** (1.7G) are excluded due to overlapping file sizes and comparable PPL, ensuring a sparse yet comprehensive selection.
101
 
 
 
102
 
103
  ![](./images/salamandra_header.png)
104
 
 
47
 
48
  # **Quantization Summary**
49
 
50
+ The base model was quantized in [llama.cpp](https://github.com/ggerganov/llama.cpp) with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset. Logs of the process are included.
51
 
52
+ - **IQ3_M**: At <1.8GB, the smallest model worth highlighting.
53
+ - **Q4_K_S**: Good size reduction with minimal PPL impact.
54
  - **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
55
  - **Q6_K**: Provides near-**bf16** performance with size savings.
56
 
 
64
  |-----------------------|------------|--------------------------|---------------|----------------------------------------------------------------|
65
  | [**IQ3_M**](salamandra-2b_IQ3_M.gguf) | 15.1995 | 0.079131 | 1.7G | Good size efficiency with acceptable PPL increase |
66
  | [**Q3_K_L**](salamandra-2b_Q3_K_L.gguf) | 15.0444 | 0.068875 | 1.8G | Further size reduction with modest PPL increase |
67
+ | [**Q4_K_S**](salamandra-2b_Q4_K_S.gguf) | 14.4338 | 0.027442 | 1.9G | Good size reduction with minimal PPL impact (**recommended**) |
 
68
  | [**Q5_K_M**](salamandra-2b_Q5_K_M.gguf) | 14.1299 | 0.006162 | 2.2G | Excellent balance of PPL and size (**recommended**) |
69
  | [**Q6_K**](salamandra-2b_Q6_K.gguf) | 14.0675 | 0.001736 | 2.4G | Nearly lossless performance with reduced size |
70
  | [**bf16**](salamandra-2b_bf16.gguf) | 14.0431 | 0.0 | 4.2G | Baseline |
 
74
  ### **Notes:**
75
 
76
  - **Recommended Quantizations:**
77
+ - **Q4_K_S**: Represents the best of the quantization types at/below **Q4** and less than 2GB, achieving good size efficiency while maintaining low perplexity.
78
  - **Q5_K_M**: Offers the best balance between low perplexity and reduced file size above **Q4**, making it ideal for most applications.
 
79
  - **Non-recommended Quanizations:**
80
+ - **IQ3_M**: Offers a smaller file size (1.7G) with an acceptable PPL increase, best among models below 1.8GB. A solid choice of the highly compressed models.
81
+ - **Q3_K_L**: Provides a slightly larger file size (1.8G) than IQ3_M, with an even better PPL.
 
82
  - **Q6_K** Similar to Q8_0, offers very close perplexity to bf16. Given its smaller file size than Q8_0 (2.4G vs. 2.7G), Q6_K provides a better size-to-performance trade-off. It was selected because it is nearly lossless and less than 2.5GB.
83
  - An attempt was made to get a model below 1.5GB, using **IQ2_XS**, but it was slightly above that size and its perplexity was clearly unacceptable (more than double the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
84
 
 
89
  The selection of recommended models is designed to provide a spectrum of options that meet the following criteria:
90
 
91
  - **Diversity in Quantization Types:**
92
+ - **I Quantization Below Q4:** **IQ3_M** is included to offer an option that uses I quantization below the **Q4** level, balancing size and performance.
93
  - **K Quantization At and Above Q4:** **Q4_K_M**, **Q5_K_M**, and **Q6_K** provide K quantization options at and above the **Q4** level, giving users choices based on their specific needs.
94
  - **Highly Compressed Quantization (Q3 and below):** **IQ3_M** and **Q3_K_L** are included as they meet the selection criteria of log PPL diff <0.3 and are not redundant with other models.
95
 
 
97
  - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
98
  - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **IQ3_XS** (1.7G) are excluded due to overlapping file sizes and comparable PPL, ensuring a sparse yet comprehensive selection.
99
 
100
+ PPL is measured (with `llama-perplexity`) from a sample of 50 of each language from the same dataset used to calculate the importance matrix.
101
+
102
 
103
  ![](./images/salamandra_header.png)
104