How to quantise the model?
Hi, I have learned how to convert the original format to hf format from your document and converted the 13b model successfully. But I can not load the hf model into GPU since it takes over than 39G VRAM. So I wonder quantise the model may solve my problem but I dont know how to do it.
Performing quantization also requires loading the model into VRAM, instead you could try some already-quantized model.
For example this gptq one
Performing quantization also requires loading the model into VRAM, instead you could try some already-quantized model.
For example this gptq one
Thank you. I realize I have to give up doing it on my own PC now. But I still want to know how to quantize a model, maybe I will try it on a smaller model like the LLama-2 7B model later.
Is that the code below is the answer to my question?
# Apply quantization
quantized_model = torch.quantization.quantize_dynamic(
model_fp32, {torch.nn.Linear}, dtype=torch.qint8, inplace=False
)
# Convert the quantized model to FP16 (half-precision) 4-bit
quantized_model = quantized_model.to(torch.float16)
# Save the quantized model
torch.save(quantized_model.state_dict(), "fp16_4bit_model.pth")