File size: 1,421 Bytes
cfda801 979b997 ef5845c cfda801 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
---
base_model: OpenScholar/Llama-3.1_OpenScholar-8B
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- llama-3.1
- autoawq
---
# Llama-3.1_OpenScholar-8B with AWQ Quantization
This is [Llama-3.1_OpenScholar-8B](https://huggingface.co/OpenScholar/Llama-3.1_OpenScholar-8B) with AWQ Quantization applied using the following code.
_Based on this [example code](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py)._
```python
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Input and output path
path = "OpenScholar/Llama-3.1_OpenScholar-8B"
output = "Llama-3.1_OpenScholar-8B-AWQ"
# Quantization config
config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path=path,
low_cpu_mem_usage=True,
use_cache=False,
safetensors=False,
device_map="cuda",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(path)
# Quantize
model.quantize(tokenizer, quant_config=config)
# Save quantized model
model.save_quantized(output)
# Save tokenizer
# Note: Transformers >= 4.45.0 doubles size of tokenizer.json
# See https://github.com/huggingface/transformers/issues/34744
tokenizer.save_pretrained(output)
print(f'Model is quantized and saved to "{output}"')
```
|