Mozilla
/

gemma-2-27b-it-llamafile

Transformers

llamafile

Inference Endpoints

Model card Files Files and versions Community

jartine commited on 24 days ago

Commit

0f41265

•

1 Parent(s): 999478d

Update README.md

Browse files

Files changed (1) hide show

README.md +124 -51

README.md CHANGED Viewed

@@ -25,78 +25,136 @@ Gemma v2 is a large language model released by Google on Jun 27th 2024.
 The model is packaged into executable weights, which we call
 [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
-easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and
-NetBSD for AMD64 and ARM64.
-## License
-The llamafile software is open source and permissively licensed. However
-the weights embedded inside the llamafiles are governed by Google's
-Gemma License and Gemma Prohibited Use Policy. This is not an open
-source license. It's about as restrictive as it gets. There's a great
-many things you're not allowed to do with Gemma. The terms of the
-license and its list of unacceptable uses can be changed by Google at
-any time. Therefore we wouldn't recommend using these llamafiles for
-anything other than evaluating the quality of Google's engineering.
-See the [LICENSE](LICENSE) file for further details.
-## Quickstart
-Running the following on a desktop OS will launch a tab in your web
-browser with a chatbot interface.
 ```
-wget https://huggingface.co/jartine/gemma-2-27b-it-llamafile/resolve/main/gemma-2-27b-it.Q6_K.llamafile
-chmod +x gemma-2-27b-it.Q6_K.llamafile
-./gemma-2-27b-it.Q6_K.llamafile
 ```
-You then need to fill out the prompt / history template (see below).
-This model has a max context window size of 8k tokens. By default, a
-context window size of 512 tokens is used. You may increase this to the
-maximum by passing the `-c 0` flag.
-On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
-the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
-driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
-or ROCm SDKs may need to be installed, in which case llamafile builds a
-native module just for your system.
 For further information, please see the [llamafile
 README](https://github.com/mozilla-ocho/llamafile/).
 Having **trouble?** See the ["Gotchas"
-section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
 of the README.
-## Prompting
-When using the browser GUI, you need to fill out the following fields.
-Prompt template (note: this is for chat; Gemma doesn't have a system role):
 ```
-{{history}}
-<start_of_turn>{{char}}
 ```
-History template:
-```
-<start_of_turn>{{name}}
-{{message}}<end_of_turn>
-```
-Here's an example of how to prompt Gemma v2 on the command line:
-```
-./gemma-2-27b-it.Q6_K.llamafile --special -p '<start_of_turn>user
-The Belobog Academy has discovered a new, invasive species of algae that can double itself in one day, and in 30 days fills a whole reservoir - contaminating the water supply. How many days would it take for the algae to fill half of the reservoir?<end_of_turn>
-<start_of_turn>model
-'
-```
 ## About Upload Limits
@@ -106,18 +164,33 @@ into a single file, using the same order.
 ## About llamafile
-llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
-It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
 ## About Quantization Formats
 This model works well with any quantization format. Q6\_K is the best
-choice overall. We tested that it's able to produce identical responses
-to the Gemma2 27B model that's hosted by Google themselves on
-aistudio.google.com. If you encounter any divergences, then try using
-the BF16 weights, which have the original fidelity.
 ---

 The model is packaged into executable weights, which we call
 [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
+easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD 7.3,
+and NetBSD for AMD64 and ARM64.
+*Software Last Updated: 2024-10-30*
+## Quickstart
+To get started, you need both the Gemma weights, and the llamafile
+software. Both of them are included in a single file, which can be
+downloaded and run as follows:
+```
+wget https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
+chmod +x gemma-2-9b-it.Q6_K.llamafile
+./gemma-2-9b-it.Q6_K.llamafile
+```
+The default mode of operation for these llamafiles is our new command
+line chatbot interface.
+![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
+Having **trouble?** See the ["Gotchas"
+section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
+of the README.
+## Usage
+By default, llamafile launches a chatbot in the terminal, and a server
+in the background. The chatbot is mostly self-explanatory. You can type
+`/help` for further details. See the [llamafile v0.8.15 release
+notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
+for documentation on our newest chatbot features.
+To instruct Gemma to do role playing, you can customize the system
+prompt as follows:
 ```
+./gemma-2-9b-it.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
 ```
+To view the man page, run:
+```
+./gemma-2-9b-it.Q6_K.llamafile --help
+```
+To send a request to the OpenAI API compatible llamafile server, try:
+```
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+     "model": "gemma-9b-it",
+     "messages": [{"role": "user", "content": "Say this is a test!"}],
+     "temperature": 0.0
+   }'
+```
+If you don't want the chatbot and you only want to run the server:
+```
+./gemma-2-9b-it.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
+```
+An advanced CLI mode is provided that's useful for shell scripting. You
+can use it by passing the `--cli` flag. For additional help on how it
+may be used, pass the `--help` flag.
+```
+./gemma-2-9b-it.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
+```
+You then need to fill out the prompt / history template (see below).
 For further information, please see the [llamafile
 README](https://github.com/mozilla-ocho/llamafile/).
+## Troubleshooting
 Having **trouble?** See the ["Gotchas"
+section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
 of the README.
+On Linux, the way to avoid run-detector errors is to install the APE
+interpreter.
+```sh
+sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
+sudo chmod +x /usr/bin/ape
+sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
+sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
+```
+On Windows there's a 4GB limit on executable sizes. This means you
+should download the Q2\_K llamafile. For better quality, consider
+instead downloading the official llamafile release binary from
+<https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
+have the .exe file extension, and then saying:
 ```
+.\llamafile-0.8.15.exe -m gemma-2-9b-it.Q6_K.llamafile
 ```
+That will overcome the Windows 4GB file size limit, allowing you to
+benefit from bigger better models.
+## Context Window
+This model has a max context window size of 8k tokens. By default, a
+context window size of 8192 tokens is used. You may limit the context
+window size by passing the `-c N` flag.
+## GPU Acceleration
+On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
+the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
+driver needs to be installed if you own an NVIDIA GPU. On Windows, if
+you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
+the flags `--recompile --gpu amd` the first time you run your llamafile.
+On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
+perform matrix multiplications. This is open source software, but it
+doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
+installed on your system, then you can pass the `--recompile` flag to
+build a GGML CUDA library just for your system that uses cuBLAS. This
+ensures you get maximum performance.
+For further information, please see the [llamafile
+README](https://github.com/mozilla-ocho/llamafile/).
 ## About Upload Limits
 ## About llamafile
+llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
+uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
 ## About Quantization Formats
 This model works well with any quantization format. Q6\_K is the best
+choice overall here. We tested that, with [our 27b Gemma2
+llamafiles](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile),
+that the llamafile implementation of Gemma2 is able to to produce
+identical responses to the Gemma2 model that's hosted by Google on
+aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
+faithful to Google's intentions. If you encounter any divergences, then
+try using the BF16 weights, which have the original fidelity.
+## See Also
+- <https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile>
+- <https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile>
+## License
+The llamafile software is open source and permissively licensed. However
+the weights embedded inside the llamafiles are governed by Google's
+Gemma License and Gemma Prohibited Use Policy. See the
+[LICENSE](LICENSE) file for further details.
 ---