Model Card for imatrix-jpn-test

gemma-2-9b-it 4bit quantized with imatrix containing a lot of Japanese text
日本語テキストを多く含むimatrixで4bit量子化されたgemma-2-9b-it

Model Details

It is known that using Importance Matrix (imatrix) when quantizing a model for llama.cpp improves performance.
Imatrixes are often created only from English text. However, if you are using a model in a language other than English, wouldn't it be better to create an imatrix that includes text in other languages as well? This model was created to verify the effectiveness of a multilingual imatrix.

モデルをllama.cpp用に量子化する際にImportance Matrix(imatrix)を使うと性能が向上する事が知られています。
imatrixは英語テキストのみから作成されている事が多いです。しかし、英語以外の言語を使ってモデルを使用するケースでは他の言語のテキストも混ぜてimatrixを作成した方がよいのではないでしょうか？
本モデルは多言語版imatrixの有効性を確かめるために作成されたモデルです。

Model Description

Performance Evaluation

The work was done on a CPU with the AVX512_BF16 flag turned on. Please note that the results may be different if you retest on a GPU.
The experiments took considerable time, totaling 18 runs (3 hours per file x 18 runs).

The imatrix-jpn-test model performed better than the no imatrix model and the Bartowski model in terms of perplexity measured with Japanese data, but was slightly higher than the Bartowski model in terms of perplexity measured with English data.
*The lower the perplexity, the better.

実験にはかなりの時間がかかり、合計 18 回実行されました (ファイルあたり 3 時間 x 18 回実行)。
作業はAVX512_BF16フラグをONにしたCPUで実施されています。GPUで追試をすると異なった結果になる可能性がある事に注意してください

imatrix-jpn-testモデルは、日本語データで測定したパープレキシティではno imatrixモデルおよびbartowskiモデルよりも優れたパフォーマンスを示しましたが、英語データで測定したパープレキシティではbartowskiモデルよりも若干高いパープレキシティを示しました。
※パープレキシティは低い方が良い指標です

Results Summary

Measurements using English wiki.test.raw suggest that imatrix improves perplexity scores.
英語のwiki.test.rawを使った計測ではimatrixがperplexityスコアを向上させる事が示唆されました

Measurements using the Japanese ja-wiki.test.raw data suggest that quantizations variation L and quantizations variation fp16 improve scores.
日本語のja-wiki.test.rawデータを使った計測ではquantizations variation Lとquantizations variation fp16がスコアを向上させる事が示唆されました

Model	wiki.test.raw Perplexity	ja-wiki.test.raw Perplexity
bartowski M	8.8140	17.2091
bartowski L	8.8137	17.1035
bartowski fp16	8.8146	17.0946
imatrix-jpn-test M	8.8231	17.2069
imatrix-jpn-test L	8.8193	17.0931
imatrix-jpn-test fp16	8.8198	17.0877
no imatrix M	8.8885	17.3948
no imatrix L	8.8938	17.2974
no imatrix fp16	8.8887	17.2740

用語集 Terminology

Importance Matrix (imatrix)
An "imatrix" is a data structure used to optimize the quantization of a model. You create one by passing text to the llama-imatrix command.
「imatrix」は、モデルの量子化を最適化するために使用されるデータ構造です。llama-imatrixコマンドにテキストを与えて作成します

wiki.test.raw score
Perplexity Score measured using wiki.test.raw published by Salesforce and the llama-perplexity command with -c 512 setting. Lower values are better.
Salesforceが公開してくれているwiki.test.rawとllama-perplexityコマンドの -c 512設定で計測した数値。値が低いほど優れています。

ja-wiki.test.raw perplexity score
Perplexity score measured in the same way using Japanese data (ja-wiki.test.raw) with the same file size as wiki.test.raw, extracted from a Japanese wiki. Lower values are better.
日本語のwikiから抜き出した文章でwiki.test.rawと同じファイルサイズにした日本語データ(ja-wiki.test.raw)で測定したperplexity score. 値が低いほど優れています。

Bartowski model
Bartowski is an individual who has quantized many models and contributed to the community. He created an imatrix from the English-only data calibration_datav3.txt and used it to quantize his model.
Bartowski は、多くのモデルを量子化し、コミュニティに貢献している実績のある人物です。彼は、英語のみのデータ calibration_datav3.txt から imatrix を作成し、それを使ってモデルを量子化しています。

Imatrix-jpn-test model
This model. It was quantized using an imatrix created by adding nearly four times as much Japanese data to calibration_datav3.txt(calibration_datav3_plus_jpn_v1.txt).
このモデル。calibration_datav3.txtに約4倍の日本語データを追加して作成されたcalibration_datav3_plus_jpn_v1.txtを使って作成されたimatrixを使用して量子化されました。

No imatrix model
This is a model quantized without using imatrix.
imatrixを使わずに量子化したモデルです。

quantizations variation M(5.76 GB)
This is the standard Q4_K_M model.
通常のQ4_K_Mモデルです
Example:
llama-quantize gemma-2-9B-it-BF16.gguf gemma-2-9b-it-Q4_K_M.gguf Q4_k_m

quantizations variation fp16(6.84 GB)
Quantization method for making output and embed tensors fp16, invented by ZeroWw.
ZeroWwが考案したoutputとembed tensorsをfp16にする量子化手法です
Example:
llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 --imatrix imatrix.dat gemma-2-9B-it-BF16.gguf gemma-2-9b-it-Q4_K_M-fp16.gguf Q4_k_m

quantizations variation L(5.98 GB)
A method often used by Bartowski for his own models, where fp16 is set to q8_0.
bartowskiが自モデルに良く使用している手法で、fp16をq8_0にした量子化手法です
Example:
llama-quantize --allow-requantize --output-tensor-type q8_0 --token-embedding-type q8_0 --imatrix imatrix.dat gemma-2-9B-it-BF16.gguf gemma-2-9b-it-Q4_K_L.gguf Q4_k_m

注意事項 Notes

These results may vary depending on the model. It is best not to assume that these results apply to all models. Gemma is known to improve performance, especially with L and fp16 quantizations variations.
Even under almost identical conditions, scores may increase or decrease slightly. It is better to focus on trends rather than small differences.
Please note that the imatrix-jpn-test model uses 5 times as much text for the imatrix as the bartowski model. There is a possibility that the performance may be slightly increased simply because there is more text.
In reality, it is better to measure performance with real tasks rather than perplexity. However, there are many different benchmarks for real tasks, so I will leave it up to you to verify this.
モデルによってこの結果は異なってくる可能性があります。あらゆるモデルに通用する結果とはまだ思わない方がよいです。gemmaは特にLおよびfp16のquantizations variationで性能が向上する事は知られています
ほぼ同等の条件でも微妙にスコアが増減する事があります。わずかな差に注目するのではなく傾向に注目する事が望ましいです
imatrix-jpn-testモデルはbartowskiモデルに比べてimatrixに5倍のテキストを使用している事に留意してください。単純にテキストが多いため性能が微妙に増えている可能性があります
本来はperplexityではなく実タスクで性能を測定する事が望ましいです。しかし、実タスクのベンチマークも多様なのでその検証は皆さんにお任せします

結論 Conclusion

Imatrix is effective in the 4-bit quantization we tried this time.
If you want to improve the performance of languages other than English, it may be worth adding other languages to the imatrix, but it may decrease the model's English ability.
If you are only using English, the quantization variations may not make much difference in 4bit.
今回試した4bit量子化においてimatrixは有効です
英語以外の言語の性能を少しでも向上させたい場合はimatrixに他言語を追加する価値はありそうです。しかし、モデルの英語能力が下がる可能性があります。
英語だけを使っている場合、量子化のバリエーションは4bitでは大きな違いがない可能性があります

その他参考情報 Other references

The following information may be helpful in your further exploration.
以下の情報は更なる探求を行う際に参考になるかもしれません。

謝辞 Acknowledgements

Thanks to the llama.cpp community.
llama.cppのコミュニティの皆さんに感謝します。
Thanks to the Google Gemma-2.
google gemma-2に感謝します
Thanks to ZeroWw, Bartowski and ,noneabove1182 for the advice and motivation.
アドバイスとモチベーションをくれたZeroWw, Bartowski, u/noneabove1182に感謝します

I do not know all the inventors of each method, so please point out any that I have missed.
各手法の考案者については私はすべてを把握できているわけではないので漏れていたら指摘してください

Developed by: [dahara1@webbigdata]
Language(s) (NLP): [English, Japanese]
base model [optional]: gemma-2-9b-it

BibTeX:

@misc{dahara2024imatrix,
  author       = {dahara1@webbigdata},
  title        = {IMatrix JPN Test: A Multilingual Model for Improved Performance},
  year         = {2024},
  howpublished = {\url{https://huggingface.co/dahara1/imatrix-jpn-test}},
  note         = {Accessed: 2024-09-23},
  abstract     = {This model demonstrates the effectiveness of using a multilingual imatrix for model quantization, especially for improving performance in Japanese and other non-English languages.},
}