README.md · ymcki/gemma-2-9b-it-GGUF at 1085bbac4937099cffac9601de68309f2579e3e3

metadata

base_model: google/gemma-2-9b-it
language:
  - multilingual
datasets:
  - TFMC/imatrix-dataset-for-japanese-llm
library_name: transformers
license: gemma
license_link: https://ai.google.dev/gemma/terms
pipeline_tag: text-generation
tags:
  - nlp
  - code
quantized_by: ymcki
widget:
  - messages:
      - role: user
        content: Can you provide ways to eat combinations of bananas and dragonfruits?

Original model: https://huggingface.co/google/gemma-2-9b-it

Description

The purpose of this repository is to see whether Japanese specific imatrix can improve the performance of a non Japanese optimized model.

It also provides the Q4_0_8_8, Q4_0_4_8 and Q4_0_4_4 ggufs for edge devices that were otherwise not made by bartowski. These models should also be good for edge devices with 16GB RAM.

Prompt format

<start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>
<start_of_turn>model

Note that this model does not support a System prompt.

Download a file (not the whole branch) from below:

ELIZA-Tasks-100 is pretty standard benchmark for Japanese LLMs. The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf scores 4.04.

Filename	Quant type	File Size	Split	ELIZA-Tasks-100	Nvidia 3090	Description
gemma-2-9b-it.f16.gguf	f16	18.49GB	false	3.75	31.9t/s	Full F16 weights.
gemma-2-9b-it.Q8_0.gguf	Q8_0	9.83GB	false	3.06	56.1t/s	Extremely high quality, recommended for edge devices with 16GB RAM.
gemma-2-2b-jpn-it-imatrix.Q4_0.gguf	Q4_0	1.63GB	false	2.89	137t/s	Good quality, recommended for edge devices wth 8GB RAM.
gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf	Q4_0_8_8	1.63GB	false	TBD	TBD	Good quality, recommended for edge device <8GB RAM.
gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf	Q4_0_4_8	1.63GB	false	TBD	TBD	Good quality, recommended for edge device <8GB RAM.
gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf	Q4_0_4_4	1.63GB	false	TBD	TBD	Good quality, recommended for edge device <8GB RAM.
gemma-2-9b-it.Q4_0.gguf	Q4_0	5.44GB	false	3.64	65.1t/s	Good quality, recommended for edge device with 8GB RAM
gemma-2-2b-jpn-it.Q4_0_8_8.gguf	Q4_0_8_8	1.63GB	false	TBD	TBD	Good quality, recommended for edge device <8GB RAM
gemma-2-2b-jpn-it.Q4_0_4_8.gguf	Q4_0_4_8	1.63GB	false	TBD	TBD	Good quality, recommended for edge device <8GB RAM
gemma-2-2b-jpn-it.Q4_0_4_4.gguf	Q4_0_4_4	1.63GB	false	TBD	TBD	Good quality, recommended for edge device <8GB RAM.

How to check i8mm and sve support for ARM devices

ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm.

ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.

For ARM devices without both, it is recommended to use Q4_0_4_4.

With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.

This is a list of ARM devices that support different ARM instructions. Apparently, it is only a partial list. It is better you check for i8mm and sve support by yourself.

For Apple devices,

sysctl hw

For other ARM devices (ie most Android devices),

cat /proc/cpuinfo

There are also android apps that can display /proc/cpuinfo.

I was told that for Intel/AMD CPU inference, support for AVX2/AVX512 can also improve the performance of Q4_0_8_8.

On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.

Which Q4_0 model to use for ARM devices

Brand	Series	Model	i8mm	sve	Quant Type
Apple	A	A4 to A14	No	No	Q4_0_4_4
Apple	A	A15 to A18	Yes	No	Q4_0_4_8
Apple	M	M1	No	No	Q4_0_4_4
Apple	M	M2/M3/M4	Yes	No	Q4_0_4_8
Google	Tensor	G1,G2	No	No	Q4_0_4_4
Google	Tensor	G3,G4	Yes	Yes	Q4_0_8_8
Samsung	Exynos	2200,2400	Yes	Yes	Q4_0_8_8
Mediatek	Dimensity	9000	Yes	Yes	Q4_0_8_8
Mediatek	Dimensity	9300	Yes	No	Q4_0_4_8
Qualcomm	Snapdragon	8 Gen 1	Yes	Yes	Q4_0_8_8
Qualcomm	Snapdragon	8 Gen 2,8 Gen 3,X Elite	Yes	No	Q4_0_4_8

imatrix quantization

According to this blog, adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is MTFMC/imatrix-dataset-for-japanese-llm. Therefore, I also created the imatrix versions of different Q4_0 quants.

However, based on my benchmarking results, the difference is not significant.

Convert safetensors to f16 gguf

Make sure you have llama.cpp git cloned:

python3 convert_hf_to_gguf.py gemma-2-2b-jpn-it/ --outfile gemma-2-2b-jpn-it.f16.gguf --outtype f16

Convert f16 gguf to Q8_0 gguf without imatrix

Make sure you have llama.cpp compiled:

./llama-quantize gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it.Q8_0.gguf q8_0

Convert f16 gguf to other ggufs with imatrix

First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt

./llama-imatrix -m gemma-2-2b-jpn-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-2b-jpn-it.imatrix --chunks 32

Then, convert f16 gguf with imatrix to create imatrix gguf

./llama-quantize --imatrix gemma-2-2b-jpn-it.imatrix gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf q4_0_8_8

Downloading using huggingface-cli

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download ymcki/gemma-2-2b-jpn-it-GGUF --include "gemma-2-2b-jpn-it-Q8_0.gguf" --local-dir ./

Credits

Thank you bartowski for providing a README.md to get me started.

Thank you YoutechA320U for the ELYZA-tasks-100 auto evaluation tool.