TheBloke
/

StableBeluga2-70B-GGML

@@ -47,8 +47,8 @@ These 70B Llama 2 GGML files currently only support CPU inference.  They are kno
 ## Repositories available
-* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/StableBeluga2-GPTQ)
-* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/StableBeluga2-GGML)
 * [Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/stabilityai/StableBeluga2)
 ## Prompt template: Orca-Hashes
@@ -94,20 +94,20 @@ Refer to the Provided Files table below to see what files use which methods, and
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
-| [stablebeluga2.ggmlv3.q2_K.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q2_K.bin) | q2_K | 2 | 28.59 GB| 31.09 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
-| [stablebeluga2.ggmlv3.q3_K_L.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q3_K_L.bin) | q3_K_L | 3 | 36.15 GB| 38.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
-| [stablebeluga2.ggmlv3.q3_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q3_K_M.bin) | q3_K_M | 3 | 33.04 GB| 35.54 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
-| [stablebeluga2.ggmlv3.q3_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q3_K_S.bin) | q3_K_S | 3 | 29.75 GB| 32.25 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
-| [stablebeluga2.ggmlv3.q4_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_0.bin) | q4_0 | 4 | 38.87 GB| 41.37 GB | Original quant method, 4-bit. |
-| [stablebeluga2.ggmlv3.q4_1.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_1.bin) | q4_1 | 4 | 43.17 GB| 45.67 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
-| [stablebeluga2.ggmlv3.q4_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_K_M.bin) | q4_K_M | 4 | 41.38 GB| 43.88 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
-| [stablebeluga2.ggmlv3.q4_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q4_K_S.bin) | q4_K_S | 4 | 38.87 GB| 41.37 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
-| [stablebeluga2.ggmlv3.q5_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q5_0.bin) | q5_0 | 5 | 47.46 GB| 49.96 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
-| [stablebeluga2.ggmlv3.q5_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q5_K_M.bin) | q5_K_M | 5 | 48.75 GB| 51.25 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
-| [stablebeluga2.ggmlv3.q5_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2.ggmlv3.q5_K_S.bin) | q5_K_S | 5 | 47.46 GB| 49.96 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
-| stablebeluga2.ggmlv3.q5_1.bin | q5_1 | 5 | 51.76 GB | 54.26 GB | Original quant method, 5-bit. Higher accuracy, slower inference than q5_0. |
-| stablebeluga2.ggmlv3.q6_K.bin | q6_K | 6 | 56.59 GB | 59.09 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
-| stablebeluga2.ggmlv3.q8_0.bin | q8_0 | 8 | 73.23 GB | 75.73 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 ### q5_1, q6_K and q8_0 files require expansion from archive
@@ -115,23 +115,23 @@ Refer to the Provided Files table below to see what files use which methods, and
 ### q5_1
 Please download:
-* `stablebeluga2.ggmlv3.q5_1.zip`
-* `stablebeluga2.ggmlv3.q5_1.z01`
 ### q6_K
 Please download:
-* `stablebeluga2.ggmlv3.q6_K.zip`
-* `stablebeluga2.ggmlv3.q6_K.z01`
 ### q8_0
 Please download:
-* `stablebeluga2.ggmlv3.q8_0.zip`
-* `stablebeluga2.ggmlv3.q8_0.z01`
 Then extract the .zip archive. This will will expand both parts automatically. On Linux I found I had to use `7zip` - the basic `unzip` tool did not work. Example:
 ```
 sudo apt update -y && sudo apt install 7zip
-7zz x stablebeluga2.ggmlv3.q6_K.zip
 ```
 Once the `.bin` is extracted you can delete the `.zip` and `.z01` files.

 ## Repositories available
+* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/StableBeluga2-70B-GPTQ)
+* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/StableBeluga2-70B-GGML)
 * [Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/stabilityai/StableBeluga2)
 ## Prompt template: Orca-Hashes
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+| [stablebeluga2-70b.ggmlv3.q2_K.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q2_K.bin) | q2_K | 2 | 28.59 GB| 31.09 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
+| [stablebeluga2-70b.ggmlv3.q3_K_L.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q3_K_L.bin) | q3_K_L | 3 | 36.15 GB| 38.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| [stablebeluga2-70b.ggmlv3.q3_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q3_K_M.bin) | q3_K_M | 3 | 33.04 GB| 35.54 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| [stablebeluga2-70b.ggmlv3.q3_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q3_K_S.bin) | q3_K_S | 3 | 29.75 GB| 32.25 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
+| [stablebeluga2-70b.ggmlv3.q4_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_0.bin) | q4_0 | 4 | 38.87 GB| 41.37 GB | Original quant method, 4-bit. |
+| [stablebeluga2-70b.ggmlv3.q4_1.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_1.bin) | q4_1 | 4 | 43.17 GB| 45.67 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
+| [stablebeluga2-70b.ggmlv3.q4_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_K_M.bin) | q4_K_M | 4 | 41.38 GB| 43.88 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
+| [stablebeluga2-70b.ggmlv3.q4_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q4_K_S.bin) | q4_K_S | 4 | 38.87 GB| 41.37 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
+| [stablebeluga2-70b.ggmlv3.q5_0.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q5_0.bin) | q5_0 | 5 | 47.46 GB| 49.96 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
+| [stablebeluga2-70b.ggmlv3.q5_K_M.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q5_K_M.bin) | q5_K_M | 5 | 48.75 GB| 51.25 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
+| [stablebeluga2-70b.ggmlv3.q5_K_S.bin](https://huggingface.co/TheBloke/StableBeluga2-GGML/blob/main/stablebeluga2-70b.ggmlv3.q5_K_S.bin) | q5_K_S | 5 | 47.46 GB| 49.96 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
+| stablebeluga2-70b.ggmlv3.q5_1.bin | q5_1 | 5 | 51.76 GB | 54.26 GB | Original quant method, 5-bit. Higher accuracy, slower inference than q5_0. |
+| stablebeluga2-70b.ggmlv3.q6_K.bin | q6_K | 6 | 56.59 GB | 59.09 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
+| stablebeluga2-70b.ggmlv3.q8_0.bin | q8_0 | 8 | 73.23 GB | 75.73 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 ### q5_1, q6_K and q8_0 files require expansion from archive
 ### q5_1
 Please download:
+* `stablebeluga2-70b.ggmlv3.q5_1.zip`
+* `stablebeluga2-70b.ggmlv3.q5_1.z01`
 ### q6_K
 Please download:
+* `stablebeluga2-70b.ggmlv3.q6_K.zip`
+* `stablebeluga2-70b.ggmlv3.q6_K.z01`
 ### q8_0
 Please download:
+* `stablebeluga2-70b.ggmlv3.q8_0.zip`
+* `stablebeluga2-70b.ggmlv3.q8_0.z01`
 Then extract the .zip archive. This will will expand both parts automatically. On Linux I found I had to use `7zip` - the basic `unzip` tool did not work. Example:
 ```
 sudo apt update -y && sudo apt install 7zip
+7zz x stablebeluga2-70b.ggmlv3.q6_K.zip
 ```
 Once the `.bin` is extracted you can delete the `.zip` and `.z01` files.