File size: 1,372 Bytes
8203415
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45b9b19
8203415
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
license: gemma
language:
- en
tags:
- conversational
quantized_by: qnixsynapse
---

## Llamacpp Quantizations of official gguf of gemma-2-9b-it from kaggle repo
Using <a href="https://github.com/ggerganov/llama.cpp/">llama.cpp</a> PR <a href="https://github.com/ggerganov/llama.cpp/pull/8156">8156</a> for quantization.

Original model: https://huggingface.co/google/gemma-2-9b-it


## Downloading using huggingface-cli

First, make sure you have hugginface-cli installed:

```
pip install -U "huggingface_hub[cli]"
```

Then, you can target the specific file you want:

```
huggingface-cli download qnixsynapse/Gemma-V2-9B-Instruct-GGUF --include "<desired model file name>" --local-dir ./
```

or you can download directly.


## Prompt format

The prompt format is same as Gemma v1 however not included with gguf file. This can be edited with gguf script to add a new key `chat_template` later.

```
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model

```

The model should stop either at `<eos>` or `<end_of_turn>`. If it doesn't then stop tokens needs to be added to the gguf metadata.

## Quants
Currently only two quants are available:
| quant | size |
|-------|-------|
| Q4_K_S| 5.5GB|
|Q3_K_M | 4.8GB|

If Q4_K_S is causing OOM when offloading all the layers to the GPU, consider decreasing batch size or use Q3_K_M.

Minimum VRAM needed: 8GB