Update README.md
Browse files
README.md
CHANGED
@@ -25,78 +25,136 @@ Gemma v2 is a large language model released by Google on Jun 27th 2024.
|
|
25 |
|
26 |
The model is packaged into executable weights, which we call
|
27 |
[llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
|
28 |
-
easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD,
|
29 |
-
NetBSD for AMD64 and ARM64.
|
30 |
|
31 |
-
|
32 |
|
33 |
-
|
34 |
-
the weights embedded inside the llamafiles are governed by Google's
|
35 |
-
Gemma License and Gemma Prohibited Use Policy. This is not an open
|
36 |
-
source license. It's about as restrictive as it gets. There's a great
|
37 |
-
many things you're not allowed to do with Gemma. The terms of the
|
38 |
-
license and its list of unacceptable uses can be changed by Google at
|
39 |
-
any time. Therefore we wouldn't recommend using these llamafiles for
|
40 |
-
anything other than evaluating the quality of Google's engineering.
|
41 |
|
42 |
-
|
|
|
|
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
-
|
47 |
-
|
48 |
|
49 |
```
|
50 |
-
|
51 |
-
chmod +x gemma-2-27b-it.Q6_K.llamafile
|
52 |
-
./gemma-2-27b-it.Q6_K.llamafile
|
53 |
```
|
54 |
|
55 |
-
|
56 |
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
|
67 |
For further information, please see the [llamafile
|
68 |
README](https://github.com/mozilla-ocho/llamafile/).
|
69 |
|
|
|
|
|
70 |
Having **trouble?** See the ["Gotchas"
|
71 |
-
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
|
72 |
of the README.
|
73 |
|
74 |
-
|
|
|
75 |
|
76 |
-
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
-
|
|
|
|
|
|
|
|
|
79 |
|
80 |
```
|
81 |
-
|
82 |
-
<start_of_turn>{{char}}
|
83 |
```
|
84 |
|
85 |
-
|
|
|
86 |
|
87 |
-
|
88 |
-
<start_of_turn>{{name}}
|
89 |
-
{{message}}<end_of_turn>
|
90 |
-
```
|
91 |
|
92 |
-
|
|
|
|
|
93 |
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
|
101 |
## About Upload Limits
|
102 |
|
@@ -106,18 +164,33 @@ into a single file, using the same order.
|
|
106 |
|
107 |
## About llamafile
|
108 |
|
109 |
-
llamafile is a new format introduced by Mozilla
|
110 |
-
|
111 |
binaries that run on the stock installs of six OSes for both ARM64 and
|
112 |
AMD64.
|
113 |
|
114 |
## About Quantization Formats
|
115 |
|
116 |
This model works well with any quantization format. Q6\_K is the best
|
117 |
-
choice overall. We tested that
|
118 |
-
|
119 |
-
|
120 |
-
the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
121 |
|
122 |
---
|
123 |
|
|
|
25 |
|
26 |
The model is packaged into executable weights, which we call
|
27 |
[llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
|
28 |
+
easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD 7.3,
|
29 |
+
and NetBSD for AMD64 and ARM64.
|
30 |
|
31 |
+
*Software Last Updated: 2024-10-30*
|
32 |
|
33 |
+
## Quickstart
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
+
To get started, you need both the Gemma weights, and the llamafile
|
36 |
+
software. Both of them are included in a single file, which can be
|
37 |
+
downloaded and run as follows:
|
38 |
|
39 |
+
```
|
40 |
+
wget https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
|
41 |
+
chmod +x gemma-2-9b-it.Q6_K.llamafile
|
42 |
+
./gemma-2-9b-it.Q6_K.llamafile
|
43 |
+
```
|
44 |
+
|
45 |
+
The default mode of operation for these llamafiles is our new command
|
46 |
+
line chatbot interface.
|
47 |
+
|
48 |
+
![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
|
49 |
+
|
50 |
+
Having **trouble?** See the ["Gotchas"
|
51 |
+
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
52 |
+
of the README.
|
53 |
+
|
54 |
+
## Usage
|
55 |
+
|
56 |
+
By default, llamafile launches a chatbot in the terminal, and a server
|
57 |
+
in the background. The chatbot is mostly self-explanatory. You can type
|
58 |
+
`/help` for further details. See the [llamafile v0.8.15 release
|
59 |
+
notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
|
60 |
+
for documentation on our newest chatbot features.
|
61 |
|
62 |
+
To instruct Gemma to do role playing, you can customize the system
|
63 |
+
prompt as follows:
|
64 |
|
65 |
```
|
66 |
+
./gemma-2-9b-it.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
|
|
|
|
|
67 |
```
|
68 |
|
69 |
+
To view the man page, run:
|
70 |
|
71 |
+
```
|
72 |
+
./gemma-2-9b-it.Q6_K.llamafile --help
|
73 |
+
```
|
74 |
|
75 |
+
To send a request to the OpenAI API compatible llamafile server, try:
|
76 |
+
|
77 |
+
```
|
78 |
+
curl http://localhost:8080/v1/chat/completions \
|
79 |
+
-H "Content-Type: application/json" \
|
80 |
+
-d '{
|
81 |
+
"model": "gemma-9b-it",
|
82 |
+
"messages": [{"role": "user", "content": "Say this is a test!"}],
|
83 |
+
"temperature": 0.0
|
84 |
+
}'
|
85 |
+
```
|
86 |
+
|
87 |
+
If you don't want the chatbot and you only want to run the server:
|
88 |
+
|
89 |
+
```
|
90 |
+
./gemma-2-9b-it.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
|
91 |
+
```
|
92 |
+
|
93 |
+
An advanced CLI mode is provided that's useful for shell scripting. You
|
94 |
+
can use it by passing the `--cli` flag. For additional help on how it
|
95 |
+
may be used, pass the `--help` flag.
|
96 |
+
|
97 |
+
```
|
98 |
+
./gemma-2-9b-it.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
|
99 |
+
```
|
100 |
+
|
101 |
+
You then need to fill out the prompt / history template (see below).
|
102 |
|
103 |
For further information, please see the [llamafile
|
104 |
README](https://github.com/mozilla-ocho/llamafile/).
|
105 |
|
106 |
+
## Troubleshooting
|
107 |
+
|
108 |
Having **trouble?** See the ["Gotchas"
|
109 |
+
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
110 |
of the README.
|
111 |
|
112 |
+
On Linux, the way to avoid run-detector errors is to install the APE
|
113 |
+
interpreter.
|
114 |
|
115 |
+
```sh
|
116 |
+
sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
|
117 |
+
sudo chmod +x /usr/bin/ape
|
118 |
+
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
119 |
+
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
120 |
+
```
|
121 |
|
122 |
+
On Windows there's a 4GB limit on executable sizes. This means you
|
123 |
+
should download the Q2\_K llamafile. For better quality, consider
|
124 |
+
instead downloading the official llamafile release binary from
|
125 |
+
<https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
|
126 |
+
have the .exe file extension, and then saying:
|
127 |
|
128 |
```
|
129 |
+
.\llamafile-0.8.15.exe -m gemma-2-9b-it.Q6_K.llamafile
|
|
|
130 |
```
|
131 |
|
132 |
+
That will overcome the Windows 4GB file size limit, allowing you to
|
133 |
+
benefit from bigger better models.
|
134 |
|
135 |
+
## Context Window
|
|
|
|
|
|
|
136 |
|
137 |
+
This model has a max context window size of 8k tokens. By default, a
|
138 |
+
context window size of 8192 tokens is used. You may limit the context
|
139 |
+
window size by passing the `-c N` flag.
|
140 |
|
141 |
+
## GPU Acceleration
|
142 |
+
|
143 |
+
On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
|
144 |
+
the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
|
145 |
+
driver needs to be installed if you own an NVIDIA GPU. On Windows, if
|
146 |
+
you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
|
147 |
+
the flags `--recompile --gpu amd` the first time you run your llamafile.
|
148 |
+
|
149 |
+
On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
|
150 |
+
perform matrix multiplications. This is open source software, but it
|
151 |
+
doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
|
152 |
+
installed on your system, then you can pass the `--recompile` flag to
|
153 |
+
build a GGML CUDA library just for your system that uses cuBLAS. This
|
154 |
+
ensures you get maximum performance.
|
155 |
+
|
156 |
+
For further information, please see the [llamafile
|
157 |
+
README](https://github.com/mozilla-ocho/llamafile/).
|
158 |
|
159 |
## About Upload Limits
|
160 |
|
|
|
164 |
|
165 |
## About llamafile
|
166 |
|
167 |
+
llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
|
168 |
+
uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
|
169 |
binaries that run on the stock installs of six OSes for both ARM64 and
|
170 |
AMD64.
|
171 |
|
172 |
## About Quantization Formats
|
173 |
|
174 |
This model works well with any quantization format. Q6\_K is the best
|
175 |
+
choice overall here. We tested that, with [our 27b Gemma2
|
176 |
+
llamafiles](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile),
|
177 |
+
that the llamafile implementation of Gemma2 is able to to produce
|
178 |
+
identical responses to the Gemma2 model that's hosted by Google on
|
179 |
+
aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
|
180 |
+
faithful to Google's intentions. If you encounter any divergences, then
|
181 |
+
try using the BF16 weights, which have the original fidelity.
|
182 |
+
|
183 |
+
## See Also
|
184 |
+
|
185 |
+
- <https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile>
|
186 |
+
- <https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile>
|
187 |
+
|
188 |
+
## License
|
189 |
+
|
190 |
+
The llamafile software is open source and permissively licensed. However
|
191 |
+
the weights embedded inside the llamafiles are governed by Google's
|
192 |
+
Gemma License and Gemma Prohibited Use Policy. See the
|
193 |
+
[LICENSE](LICENSE) file for further details.
|
194 |
|
195 |
---
|
196 |
|