Initial GGML model commit
Browse files
README.md
CHANGED
@@ -38,13 +38,11 @@ quantized_by: TheBloke
|
|
38 |
|
39 |
This repo contains GGML format model files for [Stability AI's StableBeluga 2](https://huggingface.co/stabilityai/StableBeluga2).
|
40 |
|
41 |
-
GGML files
|
42 |
-
* [
|
43 |
-
* [
|
44 |
-
* [
|
45 |
-
* [
|
46 |
-
* [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
|
47 |
-
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
|
48 |
|
49 |
## Repositories available
|
50 |
|
@@ -67,15 +65,15 @@ This is a system prompt, please behave and help the user.
|
|
67 |
<!-- compatibility_ggml start -->
|
68 |
## Compatibility
|
69 |
|
70 |
-
###
|
71 |
|
72 |
-
|
73 |
|
74 |
-
|
75 |
|
76 |
-
|
77 |
|
78 |
-
|
79 |
|
80 |
## Explanation of the new k-quant methods
|
81 |
<details>
|
@@ -115,11 +113,11 @@ Refer to the Provided Files table below to see what files use which methods, and
|
|
115 |
I use the following command line; adjust for your tastes and needs:
|
116 |
|
117 |
```
|
118 |
-
./main -t 10 -
|
119 |
```
|
120 |
Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
|
121 |
|
122 |
-
|
123 |
|
124 |
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
|
125 |
|
|
|
38 |
|
39 |
This repo contains GGML format model files for [Stability AI's StableBeluga 2](https://huggingface.co/stabilityai/StableBeluga2).
|
40 |
|
41 |
+
These 70B Llama 2 GGML files currently only support CPU inference. They are known to work with:
|
42 |
+
* [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
43 |
+
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
|
44 |
+
* [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI with GPU acceleration on all platforms (CUDA and OpenCL). Especially good for story telling.
|
45 |
+
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), version 0.1.77 and later. A Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
|
|
|
|
|
46 |
|
47 |
## Repositories available
|
48 |
|
|
|
65 |
<!-- compatibility_ggml start -->
|
66 |
## Compatibility
|
67 |
|
68 |
+
### Requires llama.cpp [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
|
69 |
|
70 |
+
Or one of the other tools and libraries listed above.
|
71 |
|
72 |
+
There is currently no GPU acceleration; only CPU can be used.
|
73 |
|
74 |
+
To use in llama.cpp, you must add `-gqa 8` argument.
|
75 |
|
76 |
+
For other UIs and libraries, please check the docs.
|
77 |
|
78 |
## Explanation of the new k-quant methods
|
79 |
<details>
|
|
|
113 |
I use the following command line; adjust for your tastes and needs:
|
114 |
|
115 |
```
|
116 |
+
./main -t 10 -gqa 8 -m stablebeluga2.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### System:\nThis is a system prompt, please behave and help the user.\n\n### User:\nWrite a story about llamas\n\n### Assistant:"
|
117 |
```
|
118 |
Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
|
119 |
|
120 |
+
Remember the `-gqa 8` argument, required for Llama 70B models.
|
121 |
|
122 |
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
|
123 |
|