Update README.md
Browse files
README.md
CHANGED
@@ -35,13 +35,22 @@ tags:
|
|
35 |
|
36 |
This repo contains GGML format model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b).
|
37 |
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
## Repositories available
|
47 |
|
@@ -58,15 +67,11 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
|
|
58 |
<!-- compatibility_ggml start -->
|
59 |
## Compatibility
|
60 |
|
61 |
-
###
|
62 |
-
|
63 |
-
These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.
|
64 |
-
|
65 |
-
### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
|
66 |
|
67 |
-
|
68 |
|
69 |
-
|
70 |
|
71 |
## Explanation of the new k-quant methods
|
72 |
<details>
|
@@ -106,17 +111,11 @@ Refer to the Provided Files table below to see what files use which methods, and
|
|
106 |
I use the following command line; adjust for your tastes and needs:
|
107 |
|
108 |
```
|
109 |
-
./main -
|
110 |
```
|
111 |
-
Change `-t
|
112 |
-
|
113 |
-
Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
|
114 |
-
|
115 |
-
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
|
116 |
-
|
117 |
-
## How to run in `text-generation-webui`
|
118 |
|
119 |
-
|
120 |
|
121 |
<!-- footer start -->
|
122 |
## Discord
|
|
|
35 |
|
36 |
This repo contains GGML format model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b).
|
37 |
|
38 |
+
## Only compatible with latest llama.cpp
|
39 |
+
|
40 |
+
To use these files you need:
|
41 |
+
|
42 |
+
1. llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
|
43 |
+
- For users who don't want to compile from source, you can use the binaries from [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630)
|
44 |
+
2. to add new command line parameter `-gqa 8`
|
45 |
+
|
46 |
+
Example command:
|
47 |
+
```
|
48 |
+
/workspace/git/llama.cpp/main -m llama-2-70b-chat/ggml/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
|
49 |
+
```
|
50 |
+
|
51 |
+
There is no CUDA support at this time, but it should be coming soon.
|
52 |
+
|
53 |
+
There is no support in third-party UIs or Python libraries (llama-cpp-python, ctransformers) yet. That will come in due course.
|
54 |
|
55 |
## Repositories available
|
56 |
|
|
|
67 |
<!-- compatibility_ggml start -->
|
68 |
## Compatibility
|
69 |
|
70 |
+
### Only compatible with llama.cpp as of commit `e76d630`
|
|
|
|
|
|
|
|
|
71 |
|
72 |
+
Compatible with llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
|
73 |
|
74 |
+
For a pre-compiled release, use [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630) or later.
|
75 |
|
76 |
## Explanation of the new k-quant methods
|
77 |
<details>
|
|
|
111 |
I use the following command line; adjust for your tastes and needs:
|
112 |
|
113 |
```
|
114 |
+
./main -m llama-2-70b.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "Llamas are"
|
115 |
```
|
116 |
+
Change `-t 13` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
|
118 |
+
No GPU support is possible yet, but it is coming soon.
|
119 |
|
120 |
<!-- footer start -->
|
121 |
## Discord
|