File size: 1,372 Bytes
8203415 45b9b19 8203415 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
---
license: gemma
language:
- en
tags:
- conversational
quantized_by: qnixsynapse
---
## Llamacpp Quantizations of official gguf of gemma-2-9b-it from kaggle repo
Using <a href="https://github.com/ggerganov/llama.cpp/">llama.cpp</a> PR <a href="https://github.com/ggerganov/llama.cpp/pull/8156">8156</a> for quantization.
Original model: https://huggingface.co/google/gemma-2-9b-it
## Downloading using huggingface-cli
First, make sure you have hugginface-cli installed:
```
pip install -U "huggingface_hub[cli]"
```
Then, you can target the specific file you want:
```
huggingface-cli download qnixsynapse/Gemma-V2-9B-Instruct-GGUF --include "<desired model file name>" --local-dir ./
```
or you can download directly.
## Prompt format
The prompt format is same as Gemma v1 however not included with gguf file. This can be edited with gguf script to add a new key `chat_template` later.
```
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
```
The model should stop either at `<eos>` or `<end_of_turn>`. If it doesn't then stop tokens needs to be added to the gguf metadata.
## Quants
Currently only two quants are available:
| quant | size |
|-------|-------|
| Q4_K_S| 5.5GB|
|Q3_K_M | 4.8GB|
If Q4_K_S is causing OOM when offloading all the layers to the GPU, consider decreasing batch size or use Q3_K_M.
Minimum VRAM needed: 8GB |