How to run the Int4 quantized model?

#10
by CharlesLincoln - opened

Same as the title

CharlesLincoln changed discussion title from How to run the 4-bit quantized model? to How to run the INT4 quantized model?
CharlesLincoln changed discussion title from How to run the INT4 quantized model? to How to run the Int4 quantized model?

from github code in basic_demo/trans_cli_vision_demo.py uncomment the block:

#model = AutoModel.from_pretrained(
#    MODEL_PATH,
#    trust_remote_code=True,
#    # attn_implementation="flash_attention_2",  # Use Flash Attention
#    torch_dtype=torch.bfloat16,
#    device_map="auto",
#).eval()


## For INT4 inference
model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).eval()

Sign up or log in to comment