jartine commited on
Commit
0f41265
1 Parent(s): 999478d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -51
README.md CHANGED
@@ -25,78 +25,136 @@ Gemma v2 is a large language model released by Google on Jun 27th 2024.
25
 
26
  The model is packaged into executable weights, which we call
27
  [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
28
- easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and
29
- NetBSD for AMD64 and ARM64.
30
 
31
- ## License
32
 
33
- The llamafile software is open source and permissively licensed. However
34
- the weights embedded inside the llamafiles are governed by Google's
35
- Gemma License and Gemma Prohibited Use Policy. This is not an open
36
- source license. It's about as restrictive as it gets. There's a great
37
- many things you're not allowed to do with Gemma. The terms of the
38
- license and its list of unacceptable uses can be changed by Google at
39
- any time. Therefore we wouldn't recommend using these llamafiles for
40
- anything other than evaluating the quality of Google's engineering.
41
 
42
- See the [LICENSE](LICENSE) file for further details.
 
 
43
 
44
- ## Quickstart
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- Running the following on a desktop OS will launch a tab in your web
47
- browser with a chatbot interface.
48
 
49
  ```
50
- wget https://huggingface.co/jartine/gemma-2-27b-it-llamafile/resolve/main/gemma-2-27b-it.Q6_K.llamafile
51
- chmod +x gemma-2-27b-it.Q6_K.llamafile
52
- ./gemma-2-27b-it.Q6_K.llamafile
53
  ```
54
 
55
- You then need to fill out the prompt / history template (see below).
56
 
57
- This model has a max context window size of 8k tokens. By default, a
58
- context window size of 512 tokens is used. You may increase this to the
59
- maximum by passing the `-c 0` flag.
60
 
61
- On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
62
- the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
63
- driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
64
- or ROCm SDKs may need to be installed, in which case llamafile builds a
65
- native module just for your system.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  For further information, please see the [llamafile
68
  README](https://github.com/mozilla-ocho/llamafile/).
69
 
 
 
70
  Having **trouble?** See the ["Gotchas"
71
- section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
72
  of the README.
73
 
74
- ## Prompting
 
75
 
76
- When using the browser GUI, you need to fill out the following fields.
 
 
 
 
 
77
 
78
- Prompt template (note: this is for chat; Gemma doesn't have a system role):
 
 
 
 
79
 
80
  ```
81
- {{history}}
82
- <start_of_turn>{{char}}
83
  ```
84
 
85
- History template:
 
86
 
87
- ```
88
- <start_of_turn>{{name}}
89
- {{message}}<end_of_turn>
90
- ```
91
 
92
- Here's an example of how to prompt Gemma v2 on the command line:
 
 
93
 
94
- ```
95
- ./gemma-2-27b-it.Q6_K.llamafile --special -p '<start_of_turn>user
96
- The Belobog Academy has discovered a new, invasive species of algae that can double itself in one day, and in 30 days fills a whole reservoir - contaminating the water supply. How many days would it take for the algae to fill half of the reservoir?<end_of_turn>
97
- <start_of_turn>model
98
- '
99
- ```
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  ## About Upload Limits
102
 
@@ -106,18 +164,33 @@ into a single file, using the same order.
106
 
107
  ## About llamafile
108
 
109
- llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
110
- It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
111
  binaries that run on the stock installs of six OSes for both ARM64 and
112
  AMD64.
113
 
114
  ## About Quantization Formats
115
 
116
  This model works well with any quantization format. Q6\_K is the best
117
- choice overall. We tested that it's able to produce identical responses
118
- to the Gemma2 27B model that's hosted by Google themselves on
119
- aistudio.google.com. If you encounter any divergences, then try using
120
- the BF16 weights, which have the original fidelity.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
  ---
123
 
 
25
 
26
  The model is packaged into executable weights, which we call
27
  [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
28
+ easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD 7.3,
29
+ and NetBSD for AMD64 and ARM64.
30
 
31
+ *Software Last Updated: 2024-10-30*
32
 
33
+ ## Quickstart
 
 
 
 
 
 
 
34
 
35
+ To get started, you need both the Gemma weights, and the llamafile
36
+ software. Both of them are included in a single file, which can be
37
+ downloaded and run as follows:
38
 
39
+ ```
40
+ wget https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
41
+ chmod +x gemma-2-9b-it.Q6_K.llamafile
42
+ ./gemma-2-9b-it.Q6_K.llamafile
43
+ ```
44
+
45
+ The default mode of operation for these llamafiles is our new command
46
+ line chatbot interface.
47
+
48
+ ![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
49
+
50
+ Having **trouble?** See the ["Gotchas"
51
+ section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
52
+ of the README.
53
+
54
+ ## Usage
55
+
56
+ By default, llamafile launches a chatbot in the terminal, and a server
57
+ in the background. The chatbot is mostly self-explanatory. You can type
58
+ `/help` for further details. See the [llamafile v0.8.15 release
59
+ notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
60
+ for documentation on our newest chatbot features.
61
 
62
+ To instruct Gemma to do role playing, you can customize the system
63
+ prompt as follows:
64
 
65
  ```
66
+ ./gemma-2-9b-it.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
 
 
67
  ```
68
 
69
+ To view the man page, run:
70
 
71
+ ```
72
+ ./gemma-2-9b-it.Q6_K.llamafile --help
73
+ ```
74
 
75
+ To send a request to the OpenAI API compatible llamafile server, try:
76
+
77
+ ```
78
+ curl http://localhost:8080/v1/chat/completions \
79
+ -H "Content-Type: application/json" \
80
+ -d '{
81
+ "model": "gemma-9b-it",
82
+ "messages": [{"role": "user", "content": "Say this is a test!"}],
83
+ "temperature": 0.0
84
+ }'
85
+ ```
86
+
87
+ If you don't want the chatbot and you only want to run the server:
88
+
89
+ ```
90
+ ./gemma-2-9b-it.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
91
+ ```
92
+
93
+ An advanced CLI mode is provided that's useful for shell scripting. You
94
+ can use it by passing the `--cli` flag. For additional help on how it
95
+ may be used, pass the `--help` flag.
96
+
97
+ ```
98
+ ./gemma-2-9b-it.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
99
+ ```
100
+
101
+ You then need to fill out the prompt / history template (see below).
102
 
103
  For further information, please see the [llamafile
104
  README](https://github.com/mozilla-ocho/llamafile/).
105
 
106
+ ## Troubleshooting
107
+
108
  Having **trouble?** See the ["Gotchas"
109
+ section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
110
  of the README.
111
 
112
+ On Linux, the way to avoid run-detector errors is to install the APE
113
+ interpreter.
114
 
115
+ ```sh
116
+ sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
117
+ sudo chmod +x /usr/bin/ape
118
+ sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
119
+ sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
120
+ ```
121
 
122
+ On Windows there's a 4GB limit on executable sizes. This means you
123
+ should download the Q2\_K llamafile. For better quality, consider
124
+ instead downloading the official llamafile release binary from
125
+ <https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
126
+ have the .exe file extension, and then saying:
127
 
128
  ```
129
+ .\llamafile-0.8.15.exe -m gemma-2-9b-it.Q6_K.llamafile
 
130
  ```
131
 
132
+ That will overcome the Windows 4GB file size limit, allowing you to
133
+ benefit from bigger better models.
134
 
135
+ ## Context Window
 
 
 
136
 
137
+ This model has a max context window size of 8k tokens. By default, a
138
+ context window size of 8192 tokens is used. You may limit the context
139
+ window size by passing the `-c N` flag.
140
 
141
+ ## GPU Acceleration
142
+
143
+ On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
144
+ the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
145
+ driver needs to be installed if you own an NVIDIA GPU. On Windows, if
146
+ you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
147
+ the flags `--recompile --gpu amd` the first time you run your llamafile.
148
+
149
+ On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
150
+ perform matrix multiplications. This is open source software, but it
151
+ doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
152
+ installed on your system, then you can pass the `--recompile` flag to
153
+ build a GGML CUDA library just for your system that uses cuBLAS. This
154
+ ensures you get maximum performance.
155
+
156
+ For further information, please see the [llamafile
157
+ README](https://github.com/mozilla-ocho/llamafile/).
158
 
159
  ## About Upload Limits
160
 
 
164
 
165
  ## About llamafile
166
 
167
+ llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
168
+ uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
169
  binaries that run on the stock installs of six OSes for both ARM64 and
170
  AMD64.
171
 
172
  ## About Quantization Formats
173
 
174
  This model works well with any quantization format. Q6\_K is the best
175
+ choice overall here. We tested that, with [our 27b Gemma2
176
+ llamafiles](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile),
177
+ that the llamafile implementation of Gemma2 is able to to produce
178
+ identical responses to the Gemma2 model that's hosted by Google on
179
+ aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
180
+ faithful to Google's intentions. If you encounter any divergences, then
181
+ try using the BF16 weights, which have the original fidelity.
182
+
183
+ ## See Also
184
+
185
+ - <https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile>
186
+ - <https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile>
187
+
188
+ ## License
189
+
190
+ The llamafile software is open source and permissively licensed. However
191
+ the weights embedded inside the llamafiles are governed by Google's
192
+ Gemma License and Gemma Prohibited Use Policy. See the
193
+ [LICENSE](LICENSE) file for further details.
194
 
195
  ---
196