How to run the Int4 quantized model?
#10
by
CharlesLincoln
- opened
Same as the title
CharlesLincoln
changed discussion title from
How to run the 4-bit quantized model?
to How to run the INT4 quantized model?
CharlesLincoln
changed discussion title from
How to run the INT4 quantized model?
to How to run the Int4 quantized model?
from github code in basic_demo/trans_cli_vision_demo.py
uncomment the block:
#model = AutoModel.from_pretrained(
# MODEL_PATH,
# trust_remote_code=True,
# # attn_implementation="flash_attention_2", # Use Flash Attention
# torch_dtype=torch.bfloat16,
# device_map="auto",
#).eval()
## For INT4 inference
model = AutoModel.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).eval()