updated readme
Browse files- noted tooling used to quantize models
- updated summary
- revised recommendations
- added detail of how PPL was calculated.
README.md
CHANGED
@@ -45,14 +45,24 @@ language:
|
|
45 |
---
|
46 |
source repo: [BSC-LT/salamandra-2b-instruct](https://huggingface.co/BSC-LT/salamandra-2b-instruct)
|
47 |
|
48 |
-
# Quantization summary
|
49 |
|
50 |
-
The base model was quantized with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
| **Quantization Type** | **PPL(Q)** | **ln(PPL(Q)/PPL(bf16))** | **File Size (G)** | **Notes** |
|
53 |
|-----------------------|------------|------------------------|-------------------|----------------------------------------------------------------|
|
54 |
| [**IQ3_M**](salamandra-2b-instruct_IQ3_M.gguf) | 16.774 | 0.086769 | 1.7 | Good size efficiency with acceptable PPL increase |
|
55 |
| [**Q3_K_L**](salamandra-2b-instruct_Q3_K_L.gguf) | 16.5067 | 0.070705 | 1.8 | Further size reduction with modest PPL increase |
|
|
|
56 |
| [**Q4_K_S**](salamandra-2b-instruct_Q4_K_S.gguf) | 15.9346 | 0.035431 | 1.9 | Good size reduction with minimal PPL impact (**recommended**) |
|
57 |
| [**Q5_K_M**](salamandra-2b-instruct_Q5_K_M.gguf) | 15.4746 | 0.006139 | 2.2 | Excellent balance of PPL and size (**recommended**) |
|
58 |
| [**Q6_K**](salamandra-2b-instruct_Q6_K.gguf) | 15.3961 | 0.001053 | 2.4 | Nearly lossless performance with reduced size |
|
@@ -61,12 +71,13 @@ The base model was quantized with a substantive importance matrix over all targe
|
|
61 |
### **Notes:**
|
62 |
|
63 |
- **Recommended Quantizations:**
|
64 |
-
- **
|
|
|
65 |
- **Q5_K_M:** Offers the best balance between low perplexity and reduced file size above Q4, making it ideal for most applications.
|
66 |
-
- **Q6_K:** Delivers nearly lossless performance compared to bf16 with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
|
67 |
- **Non-recommended Quantizations:**
|
68 |
- **IQ3_M:** Represents the best of the I quantization types below Q4, achieving good size efficiency while maintaining low perplexity.
|
69 |
- **Q3_K_L:** Provides a slightly larger file size (1.8G) with an acceptable PPL (16.5067). While it meets the log PPL difference criteria, it is not as balanced as the recommended quantizations.
|
|
|
70 |
- An attempt was made to get a model below **IQ3_M** size, but perplexity was unacceptable even with **IQ2_M** (more than the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-instruct-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
|
71 |
|
72 |
---
|
@@ -84,6 +95,8 @@ The selection of recommended models is designed to provide a spectrum of options
|
|
84 |
- **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
|
85 |
- **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **Q3_K_M** (1.7G) are excluded due to nearly equal file sizes and differing PPL, ensuring a sparse yet comprehensive selection.
|
86 |
|
|
|
|
|
87 |
---
|
88 |
|
89 |
# Comparison of salamandra 2b/instruct quantization results
|
|
|
45 |
---
|
46 |
source repo: [BSC-LT/salamandra-2b-instruct](https://huggingface.co/BSC-LT/salamandra-2b-instruct)
|
47 |
|
48 |
+
# **Quantization summary**
|
49 |
|
50 |
+
The base model was quantized in [llama.cpp](https://github.com/ggerganov/llama.cpp) with a substantive importance matrix over all target languages (some 34x1000 samples, 96MB of text) with samples from the [Open Super-large Crawled ALMAnaCH coRpus](/datasets/oscar-corpus/oscar) dataset. Logs of the process are included.
|
51 |
+
|
52 |
+
- **IQ3_M**: At <1.8GB, the smallest model worth highlighting.
|
53 |
+
- **IQ4_XS** or **Q4_K_S**: Its a toss up for the sub-2GB quantizations. Metal users will get more t/s from Q4_K_S.
|
54 |
+
- **Q5_K_M**: Excellent balance above **Q4**, recommended for most applications.
|
55 |
+
- **Q6_K**: Provides near-**bf16** performance with size savings.
|
56 |
+
|
57 |
+
---
|
58 |
+
|
59 |
+
# Quantization
|
60 |
|
61 |
| **Quantization Type** | **PPL(Q)** | **ln(PPL(Q)/PPL(bf16))** | **File Size (G)** | **Notes** |
|
62 |
|-----------------------|------------|------------------------|-------------------|----------------------------------------------------------------|
|
63 |
| [**IQ3_M**](salamandra-2b-instruct_IQ3_M.gguf) | 16.774 | 0.086769 | 1.7 | Good size efficiency with acceptable PPL increase |
|
64 |
| [**Q3_K_L**](salamandra-2b-instruct_Q3_K_L.gguf) | 16.5067 | 0.070705 | 1.8 | Further size reduction with modest PPL increase |
|
65 |
+
| [**IQ4_XS**](salamandra-2b-instruct_IQ4_XS.gguf) | 15.9591 | 0.036968 | 1.8 | Good size reduction with acceptable PPL increase (**recommended**) |
|
66 |
| [**Q4_K_S**](salamandra-2b-instruct_Q4_K_S.gguf) | 15.9346 | 0.035431 | 1.9 | Good size reduction with minimal PPL impact (**recommended**) |
|
67 |
| [**Q5_K_M**](salamandra-2b-instruct_Q5_K_M.gguf) | 15.4746 | 0.006139 | 2.2 | Excellent balance of PPL and size (**recommended**) |
|
68 |
| [**Q6_K**](salamandra-2b-instruct_Q6_K.gguf) | 15.3961 | 0.001053 | 2.4 | Nearly lossless performance with reduced size |
|
|
|
71 |
### **Notes:**
|
72 |
|
73 |
- **Recommended Quantizations:**
|
74 |
+
- **IQ4_XL:** A good size reduction with minimal PPL impact. The filesize is actually very close to 1.9GB, so not much different from Q4_K_S.
|
75 |
+
- **Q4_K_S:** A good size reduction with minimal PPL impact.
|
76 |
- **Q5_K_M:** Offers the best balance between low perplexity and reduced file size above Q4, making it ideal for most applications.
|
|
|
77 |
- **Non-recommended Quantizations:**
|
78 |
- **IQ3_M:** Represents the best of the I quantization types below Q4, achieving good size efficiency while maintaining low perplexity.
|
79 |
- **Q3_K_L:** Provides a slightly larger file size (1.8G) with an acceptable PPL (16.5067). While it meets the log PPL difference criteria, it is not as balanced as the recommended quantizations.
|
80 |
+
- **Q6_K:** Delivers nearly lossless performance compared to bf16 with a reduced file size (2.4G vs. 4.2G). Ideal for scenarios requiring maximum accuracy with some size savings.
|
81 |
- An attempt was made to get a model below **IQ3_M** size, but perplexity was unacceptable even with **IQ2_M** (more than the 0.3 selection crteria, see next section). If you need a model below 1.7GB, you may be better served by Richard Erkhov's [quantizations](https://huggingface.co/RichardErkhov/BSC-LT_-_salamandra-2b-instruct-gguf), which seem to be a static quantization instead of using an importance matrix, so they are smaller.
|
82 |
|
83 |
---
|
|
|
95 |
- **Log PPL diff <0.3:** All included models have a log PPL difference under 0.3, ensuring that they maintain acceptable performance even when highly quantized.
|
96 |
- **No Multiple Models Within 100MB of the Same File Size:** Only one model is included per similar file size range to avoid redundancy. For example, **Q3_K_L** (1.8G) is included while other models like **Q3_K_M** (1.7G) are excluded due to nearly equal file sizes and differing PPL, ensuring a sparse yet comprehensive selection.
|
97 |
|
98 |
+
PPL is measured from a sample of 50 of each language from the same dataset used to calculate the importance matrix.
|
99 |
+
|
100 |
---
|
101 |
|
102 |
# Comparison of salamandra 2b/instruct quantization results
|