TheBloke
/

Llama-2-70B-GGML

@@ -35,13 +35,22 @@ tags:
 This repo contains GGML format model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b).
-GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
-* [KoboldCpp](https://github.com/LostRuins/koboldcpp), a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling.
-* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with GPU acceleration via the c_transformers backend.
-* [LM Studio](https://lmstudio.ai/), a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel.
-* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend.
-* [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
-* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
 ## Repositories available
@@ -58,15 +67,11 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
 <!-- compatibility_ggml start -->
 ## Compatibility
-### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
-These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.
-### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
-These new quantisation methods are compatible with llama.cpp as of June 6th, commit `2d43387`.
-They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python, ctransformers, rustformers and most others. For compatibility with other tools and libraries, please check their documentation.
 ## Explanation of the new k-quant methods
 <details>
@@ -106,17 +111,11 @@ Refer to the Provided Files table below to see what files use which methods, and
 I use the following command line; adjust for your tastes and needs:
 ```
-./main -t 10 -ngl 32 -m llama-2-70b.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
 ```
-Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
-Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
-If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
-## How to run in `text-generation-webui`
-Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
 <!-- footer start -->
 ## Discord

 This repo contains GGML format model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b).
+## Only compatible with latest llama.cpp
+To use these files you need:
+1. llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
+ - For users who don't want to compile from source, you can use the binaries from [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630)
+2. to add new command line parameter `-gqa 8`
+Example command:
+```
+/workspace/git/llama.cpp/main -m llama-2-70b-chat/ggml/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
+```
+There is no CUDA support at this time, but it should be coming soon.
+There is no support in third-party UIs or Python libraries (llama-cpp-python, ctransformers) yet. That will come in due course.
 ## Repositories available
 <!-- compatibility_ggml start -->
 ## Compatibility
+### Only compatible with llama.cpp as of commit `e76d630`
+Compatible with llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
+For a pre-compiled release, use [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630) or later.
 ## Explanation of the new k-quant methods
 <details>
 I use the following command line; adjust for your tastes and needs:
 ```
+./main -m llama-2-70b.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "Llamas are"
 ```
+Change `-t 13` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
+No GPU support is possible yet, but it is coming soon.
 <!-- footer start -->
 ## Discord