robbiemu commited on
Commit
ca29512
1 Parent(s): e454a4c

updated readme

Browse files

- noted tooling used to quantize models
- updated summary
- revised recommendations
- added detail of how PPL was calculated.

Files changed (1) hide show
  1. README.md +17 -4
README.md CHANGED
@@ -45,14 +45,24 @@ language:
45
  ---
46
  source repo: [BSC-LT/salamandra-2b-instruct](https://huggingface.co/BSC-LT/salamandra-2b-instruct)
47
 
48
- # Quantization summary
49
 
50
- The base model was quantized with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset.
 
 
 
 
 
 
 
 
 
51
 
52
  | **Quantization Type** | **PPL(Q)** | **ln(PPL(Q)/PPL(bf16))** | **File Size (G)** | **Notes** |
53
  |-----------------------|------------|------------------------|-------------------|----------------------------------------------------------------|
54
  | [**IQ3_M**](salamandra-2b-instruct_IQ3_M.gguf) | 16.774 | 0.086769 | 1.7 | Good size efficiency with acceptable PPL increase |
55
  | [**Q3_K_L**](salamandra-2b-instruct_Q3_K_L.gguf) | 16.5067 | 0.070705 | 1.8 | Further size reduction with modest PPL increase |
 
56
  | [**Q4_K_S**](salamandra-2b-instruct_Q4_K_S.gguf) | 15.9346 | 0.035431 | 1.9 | Good size reduction with minimal PPL impact (**recommended**) |
57
  | [**Q5_K_M**](salamandra-2b-instruct_Q5_K_M.gguf) | 15.4746 | 0.006139 | 2.2 | Excellent balance of PPL and size (**recommended**) |
58
  | [**Q6_K**](salamandra-2b-instruct_Q6_K.gguf) | 15.3961 | 0.001053 | 2.4 | Nearly lossless performance with reduced size |
@@ -61,12 +71,13 @@ The base model was quantized with a substantive importance matrix over all targe
61
  ### **Notes:**
62
 
63
  - **Recommended Quantizations:**
64
- - **Q4_K_S:** Although it offers good size reduction with minimal PPL impact, it is superseded by more optimal choices like Q5_K_M and Q6_K, but it is the only model with minimal PPL impact below 2GB.
 
65
  - **Q5_K_M:** Offers the best balance between low perplexity and reduced file size above Q4, making it ideal for most applications.
66
- - **Q6_K:** Delivers nearly lossless performance compared to bf16 with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
67
  - **Non-recommended Quantizations:**
68
  - **IQ3_M:** Represents the best of the I quantization types below Q4, achieving good size efficiency while maintaining low perplexity.
69
  - **Q3_K_L:** Provides a slightly larger file size (1.8G) with an acceptable PPL (16.5067). While it meets the log PPL difference criteria, it is not as balanced as the recommended quantizations.
 
70
  - An attempt was made to get a model below **IQ3_M** size, but perplexity was unacceptable even with **IQ2_M** (more than the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-instruct-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
71
 
72
  ---
@@ -84,6 +95,8 @@ The selection of recommended models is designed to provide a spectrum of options
84
  - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
85
  - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **Q3_K_M** (1.7G) are excluded due to nearly equal file sizes and differing PPL, ensuring a sparse yet comprehensive selection.
86
 
 
 
87
  ---
88
 
89
  # Comparison of salamandra 2b/instruct quantization results
 
45
  ---
46
  source repo: [BSC-LT/salamandra-2b-instruct](https://huggingface.co/BSC-LT/salamandra-2b-instruct)
47
 
48
+ # **Quantization summary**
49
 
50
+ The base model was quantized in [llama.cpp](https://github.com/ggerganov/llama.cpp) with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset. Logs of the process are included.
51
+
52
+ - **IQ3_M**: At <1.8GB, the smallest model worth highlighting.
53
+ - **IQ4_XS** or **Q4_K_S**: Its a toss up for the sub-2GB quantizations. Metal users will get more t/s from Q4_K_S.
54
+ - **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
55
+ - **Q6_K**: Provides near-**bf16** performance with size savings.
56
+
57
+ ---
58
+
59
+ # Quantization
60
 
61
  | **Quantization Type** | **PPL(Q)** | **ln(PPL(Q)/PPL(bf16))** | **File Size (G)** | **Notes** |
62
  |-----------------------|------------|------------------------|-------------------|----------------------------------------------------------------|
63
  | [**IQ3_M**](salamandra-2b-instruct_IQ3_M.gguf) | 16.774 | 0.086769 | 1.7 | Good size efficiency with acceptable PPL increase |
64
  | [**Q3_K_L**](salamandra-2b-instruct_Q3_K_L.gguf) | 16.5067 | 0.070705 | 1.8 | Further size reduction with modest PPL increase |
65
+ | [**IQ4_XS**](salamandra-2b-instruct_IQ4_XS.gguf) | 15.9591 | 0.036968 | 1.8 | Good size reduction with acceptable PPL increase (**recommended**) |
66
  | [**Q4_K_S**](salamandra-2b-instruct_Q4_K_S.gguf) | 15.9346 | 0.035431 | 1.9 | Good size reduction with minimal PPL impact (**recommended**) |
67
  | [**Q5_K_M**](salamandra-2b-instruct_Q5_K_M.gguf) | 15.4746 | 0.006139 | 2.2 | Excellent balance of PPL and size (**recommended**) |
68
  | [**Q6_K**](salamandra-2b-instruct_Q6_K.gguf) | 15.3961 | 0.001053 | 2.4 | Nearly lossless performance with reduced size |
 
71
  ### **Notes:**
72
 
73
  - **Recommended Quantizations:**
74
+ - **IQ4_XL:** A good size reduction with minimal PPL impact. The filesize is actually very close to 1.9GB, so not much different from Q4_K_S.
75
+ - **Q4_K_S:** A good size reduction with minimal PPL impact.
76
  - **Q5_K_M:** Offers the best balance between low perplexity and reduced file size above Q4, making it ideal for most applications.
 
77
  - **Non-recommended Quantizations:**
78
  - **IQ3_M:** Represents the best of the I quantization types below Q4, achieving good size efficiency while maintaining low perplexity.
79
  - **Q3_K_L:** Provides a slightly larger file size (1.8G) with an acceptable PPL (16.5067). While it meets the log PPL difference criteria, it is not as balanced as the recommended quantizations.
80
+ - **Q6_K:** Delivers nearly lossless performance compared to bf16 with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
81
  - An attempt was made to get a model below **IQ3_M** size, but perplexity was unacceptable even with **IQ2_M** (more than the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-instruct-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
82
 
83
  ---
 
95
  - **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
96
  - **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **Q3_K_M** (1.7G) are excluded due to nearly equal file sizes and differing PPL, ensuring a sparse yet comprehensive selection.
97
 
98
+ PPL is measured from a sample of 50 of each language from the same dataset used to calculate the importance matrix.
99
+
100
  ---
101
 
102
  # Comparison of salamandra 2b/instruct quantization results