metadata

{}

Model Card for Model ID

gemma-2-9b-it quantized with imatrix containing a lot of Japanese text
日本語テキストを多く含むimatrixで量子化されたgemma-2-9b-it

Model Details

It is known that using imatrix when quantizing a model for llama.cpp improves performance.
However, imatrix is often created only from English text. In cases where a model is used in languages other than English, wouldn't it be better to create an imatrix by mixing text in other languages?
This page confirms the effectiveness of multilingual imatrix.

モデルをllama.cpp用に量子化する際にimatrixを使うと性能が向上する事が知られています。
しかし、imatrixは英語テキストのみから作成されている事が多いです。英語以外の言語を使ってモデルを使用するケースでは他の言語のテキストも混ぜてimatrixを作成した方がよいのではないでしょうか？
本ページは多言語版imatrixの有効性を確かめました。

Model Description

Measurements using English wiki.test.raw show that imatrix contributes to the score.
英語のwiki.test.rawを使った計測ではimatrixがスコアに貢献する事がわかった

Measurements using Japanese ja-wiki.test.raw data showed that L/fp16 quants contributed to the score.
日本語のja-wiki.test.rawデータを使った計測ではL/fp16クォンツがスコアに貢献する事がわかった

用語集 Terminology

wiki.test.raw score
Perplexity Score measured using wiki.test.raw published by Salesforce and the llama-perplexity command with -c 512 setting.
Salesforceが公開してくれているwiki.test.rawとllama-perplexityコマンドの -c 512設定で計測した数値

ja-wiki.test.raw perplexity score
Perplexity score measured in the same way using Japanese data (ja-wiki.test.raw) with the same file size as wiki.test.raw, extracted from a Japanese wiki.
日本語のwikiから抜き出した文章でwiki.test.rawと同じファイルサイズにした日本語データ(ja-wiki.test.raw)で測定したperplexity score.

Bartowski model
Bartowski is an individual who has quantized many models and contributed to the community. He created an imatrix from the English-only data calibration_datav3.txt and used it to quantize his model.
Bartowski は、多くのモデルを量子化し、コミュニティに貢献している人物です。彼は、英語のみのデータ calibration_datav3.txt から imatrix を作成し、それを使ってモデルを量子化しています。

Imatrix-jpn-test model
This model. It was quantized using an imatrix created by adding nearly four times as much Japanese data to calibration_datav3.txt(calibration_datav3_plus_jpn_v1.txt).
このモデル。calibration_datav3.txtに約4倍の日本語データを追加して作成されたcalibration_datav3_plus_jpn_v1.txtを使って作成されたimatrixを使用して量子化されました。

No imatrix model
This is a model quantized without using imatrix.
imatrixを使わずに量子化したモデルです。

quantizations variation M(5.76 GB)
This is the standard Q4_K_M model.
通常のQ4_K_Mモデルです
Example:
llama-quantize gemma-2-9B-it-BF16.gguf gemma-2-9b-it-Q4_K_M.gguf Q4_k_m

quantizations variation fp16(6.84 GB)
Quantization method for making output and embed tensors fp16, invented by ZeroWw.
ZeroWwが考案したoutputとembed tensorsをfp16にする量子化手法です
Example:
llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 --imatrix imatrix.dat gemma-2-9B-it-BF16.gguf gemma-2-9b-it-Q4_K_M-fp16.gguf Q4_k_m

quantizations variation L(5.98 GB)
A method often used by Bartowski for his own models, where fp16 is set to q8_0.
bartowskiが自モデルに良く使用している手法で、fp16をq8_0にした量子化手法です
Example:
llama-quantize --allow-requantize --output-tensor-type q8_0 --token-embedding-type q8_0 --imatrix imatrix.dat gemma-2-9B-it-BF16.gguf gemma-2-9b-it-Q4_K_L.gguf Q4_k_m

注意事項 Notes

These results may vary depending on the model. It is best not to assume that these results apply to all models.In particular, gemma is said to improve performance with L/fp16 quant.
Even under almost identical conditions, scores may increase or decrease slightly. It is better to focus on trends rather than small differences.
Please note that the imatrix-jpn-test model uses 5 times as much text for the imatrix as the bartowski model. There is a possibility that the performance may be slightly increased simply because there is more text.
In reality, it is better to measure performance with real tasks rather than perplexity. However, there are many different benchmarks for real tasks, so I will leave it up to you to verify this.
モデルによってこの結果は異なってくる可能性があります。あらゆるモデルに通用する結果とはまだ思わない方がよいです。特にgemmaはL/fp16クォンツで性能が向上すると言われています
ほぼ同等の条件でも微妙にスコアが増減する事があります。わずかな差に注目するのではなく傾向に注目する事が望ましいです
imatrix-jpn-testモデルはbartowskiモデルに比べてimatrixに5倍のテキストを使用している事に留意してください。単純にテキストが多いため性能が微妙に増えている可能性があります
本来はperplexityではなく実タスクで性能を測定する事が望ましいです。しかし、実タスクのベンチマークも多様なのでその検証は皆さんにお任せします

考察 Considerations

In the 4-bit quantization we tested this time, imatrix seems to be effective in all cases.
If you want to improve the performance of languages other than English even a little, it seems worth adding other languages. However, there is a possibility that your English ability may decrease.
If you are only using English, the quantization variations may not make much difference.
今回試した4bit量子化においては全てのケースにおいてimatrixは有効であるようです
英語以外の言語の性能を少しでも向上させたい場合は他言語を追加する価値はありそうです。しかし、英語能力が下がる可能性があります。
英語だけを使っている場合、量子化のバリエーションは大きな違いがない可能性があります

その他参考情報 Other references

The following information may be helpful in your further exploration.
以下の情報は更なる探求を行う際に参考になるかもしれません。

謝辞 Acknowledgements

Thanks to the llama.cpp community. 　 llama.cppのコミュニティの皆さんに感謝します。
Thanks to the Google Gemma-2.
google gemma-2に感謝します
Thanks to u/noneabove1182 for the advice and motivation.
アドバイスとモチベーションをくれたu/noneabove1182に感謝します

I do not know all the inventors of each method, so please point out any that I have missed.
各手法の考案者については私はすべてを把握できているわけではないので漏れていたら指摘してください

Developed by: [dahara1@webbigdata]
Language(s) (NLP): [English, Japanese]
base model [optional]: [gemma-2-9b-it]

BibTeX:

@misc{dahara2024imatrix,
  author       = {dahara1@webbigdata},
  title        = {IMatrix JPN Test: A Multilingual Model for Improved Performance},
  year         = {2024},
  howpublished = {\url{https://huggingface.co/dahara1/imatrix-jpn-test}},
  note         = {Accessed: 2024-09-23},
  abstract     = {This model demonstrates the effectiveness of using a multilingual imatrix for model quantization, especially for improving performance in Japanese and other non-English languages.},
}