How to run the Int4 quantized model？

#10

by CharlesLincoln - opened Jun 6

Jun 6

Same as the title

CharlesLincoln changed discussion title from How to run the 4-bit quantized model？ to How to run the INT4 quantized model？ Jun 6

CharlesLincoln changed discussion title from How to run the INT4 quantized model？ to How to run the Int4 quantized model？ Jun 6

GeroldMeisinger

Aug 17

from github code in basic_demo/trans_cli_vision_demo.py uncomment the block:

#model = AutoModel.from_pretrained(
#    MODEL_PATH,
#    trust_remote_code=True,
#    # attn_implementation="flash_attention_2",  # Use Flash Attention
#    torch_dtype=torch.bfloat16,
#    device_map="auto",
#).eval()


## For INT4 inference
model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).eval()

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment