Text Generation
Transformers
English
llama
TheBloke commited on
Commit
6c7d755
1 Parent(s): 6bb6f2b

Initial GGML model commit

Browse files
Files changed (1) hide show
  1. README.md +12 -14
README.md CHANGED
@@ -38,13 +38,11 @@ quantized_by: TheBloke
38
 
39
  This repo contains GGML format model files for [Stability AI's StableBeluga 2](https://huggingface.co/stabilityai/StableBeluga2).
40
 
41
- GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
42
- * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI. Supports NVidia CUDA GPU acceleration.
43
- * [KoboldCpp](https://github.com/LostRuins/koboldcpp), a powerful GGML web UI with GPU acceleration on all platforms (CUDA and OpenCL). Especially good for story telling.
44
- * [LM Studio](https://lmstudio.ai/), a fully featured local GUI with GPU acceleration on both Windows (NVidia and AMD), and macOS.
45
- * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with CUDA GPU acceleration via the c_transformers backend.
46
- * [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
47
- * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
48
 
49
  ## Repositories available
50
 
@@ -67,15 +65,15 @@ This is a system prompt, please behave and help the user.
67
  <!-- compatibility_ggml start -->
68
  ## Compatibility
69
 
70
- ### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
71
 
72
- These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.
73
 
74
- ### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
75
 
76
- These new quantisation methods are compatible with llama.cpp as of June 6th, commit `2d43387`.
77
 
78
- They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python, ctransformers, rustformers and most others. For compatibility with other tools and libraries, please check their documentation.
79
 
80
  ## Explanation of the new k-quant methods
81
  <details>
@@ -115,11 +113,11 @@ Refer to the Provided Files table below to see what files use which methods, and
115
  I use the following command line; adjust for your tastes and needs:
116
 
117
  ```
118
- ./main -t 10 -ngl 32 -m stablebeluga2.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
119
  ```
120
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
121
 
122
- Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
123
 
124
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
125
 
 
38
 
39
  This repo contains GGML format model files for [Stability AI's StableBeluga 2](https://huggingface.co/stabilityai/StableBeluga2).
40
 
41
+ These 70B Llama 2 GGML files currently only support CPU inference. They are known to work with:
42
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
43
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
44
+ * [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI with GPU acceleration on all platforms (CUDA and OpenCL). Especially good for story telling.
45
+ * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), version 0.1.77 and later. A Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
 
 
46
 
47
  ## Repositories available
48
 
 
65
  <!-- compatibility_ggml start -->
66
  ## Compatibility
67
 
68
+ ### Requires llama.cpp [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
69
 
70
+ Or one of the other tools and libraries listed above.
71
 
72
+ There is currently no GPU acceleration; only CPU can be used.
73
 
74
+ To use in llama.cpp, you must add `-gqa 8` argument.
75
 
76
+ For other UIs and libraries, please check the docs.
77
 
78
  ## Explanation of the new k-quant methods
79
  <details>
 
113
  I use the following command line; adjust for your tastes and needs:
114
 
115
  ```
116
+ ./main -t 10 -gqa 8 -m stablebeluga2.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### System:\nThis is a system prompt, please behave and help the user.\n\n### User:\nWrite a story about llamas\n\n### Assistant:"
117
  ```
118
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
119
 
120
+ Remember the `-gqa 8` argument, required for Llama 70B models.
121
 
122
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
123