File size: 2,903 Bytes

9bcb78e
 
 
711eae9
9bcb78e
 
711eae9
9bcb78e
4521a94
9bcb78e
92886bd
 
 
9bcb78e
1bbbf96
 
608be6d
 
 
4ca4e8f
9bcb78e
eef4f3c
9bcb78e
 
67e3618
 
9bcb78e
 
 
 
67e3618
9bcb78e
67e3618
9bcb78e
67e3618
114db8f
9bcb78e
 
 
 
 
 
 
 
 
 
 
 
67e3618

For NVIDIA cards install the CUDA toolkit

Nvidia Maxwell or higher
https://developer.nvidia.com/cuda-downloads

Nvidia Kepler or higher
https://developer.nvidia.com/cuda-11-8-0-download-archive

Restart your computer after installing the CUDA toolkit to make sure the PATH is set correctly.

Haven't done much testing but for Windows, Visual Studio 2019 with desktop development for C++ might be required.
https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=community&rel=16&utm_medium=microsoft&utm_campaign=download+from+relnotes&utm_content=vs2019ga+button
install the desktop development for C++ workload

Make sure to have git and wget installed on your system.

For Linux, install the build tools from your package manager.
For example, on Ubuntu use: sudo apt-get install build-essential

This may work with AMD cards but only on linux and possibly WSL2. I can't guarantee that it will work on AMD cards, I personally don't have one to test with. You may need to install stuff before starting. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

Only python 3.8 - 3.11 is known to work. If you have a higher version of python, I can't guarantee that it will work.

First setup your environment by using either windows.bat or linux.sh. If something fails during setup, then delete venv folder and try again.

After setup is complete then you'll have a file called start-quant. Use this to run the quant script.

Make sure that your storage space is 3x the amount of the model's size. To mesure this, take the number of billion parameters and mutliply by two, afterwards mutliply by 3 and that's the recommended storage. There's a chance you may get away with 2.5x the size as well.
Make sure to also have a lot of RAM depending on the model. Have noticed gemma to use a lot.

If you close the terminal or the terminal crashes, check the last BPW it was on and enter the remaining quants you wanted. It should be able to pick up where it left off. Don't type the BPW of completed quants as it will start from the beginning. You may also use ctrl + c to pause at any time during the quant process.

To add more options to the quantization process, you can add them to line 174. All options: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md

Things may break in the future as it downloads the latest version of all the dependencies which may either change names or how they work. If something breaks, please open a discussion at https://huggingface.co/Anthonyg5005/hf-scripts/discussions

Credit to turboderp for creating exllamav2 and the exl2 quantization method.
https://github.com/turboderp

Credit to oobabooga the original download and safetensors scripts.
https://github.com/oobabooga

Credit to Lucain Pouget for maintaining huggingface-hub.
https://github.com/Wauplin

Only tested with CUDA 12.1 on Windows 11