NeuML
/

Llama-3.1_OpenScholar-8B-AWQ

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

Llama-3.1_OpenScholar-8B-AWQ / README.md

davidmezzetti's picture

Update README

979b997 14 days ago

|

history blame contribute delete

1.42 kB

	---
	base_model: OpenScholar/Llama-3.1_OpenScholar-8B
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- llama-3.1
	- autoawq
	---

	# Llama-3.1_OpenScholar-8B with AWQ Quantization

	This is [Llama-3.1_OpenScholar-8B](https://huggingface.co/OpenScholar/Llama-3.1_OpenScholar-8B) with AWQ Quantization applied using the following code.

	_Based on this [example code](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py)._

	```python
	import torch

	from awq import AutoAWQForCausalLM
	from transformers import AutoTokenizer

	# Input and output path
	path = "OpenScholar/Llama-3.1_OpenScholar-8B"
	output = "Llama-3.1_OpenScholar-8B-AWQ"

	# Quantization config
	config = {
	"zero_point": True,
	"q_group_size": 128,
	"w_bit": 4,
	"version": "GEMM"
	}

	# Load model
	model = AutoAWQForCausalLM.from_pretrained(
	model_path=path,
	low_cpu_mem_usage=True,
	use_cache=False,
	safetensors=False,
	device_map="cuda",
	torch_dtype=torch.bfloat16
	)
	tokenizer = AutoTokenizer.from_pretrained(path)

	# Quantize
	model.quantize(tokenizer, quant_config=config)

	# Save quantized model
	model.save_quantized(output)

	# Save tokenizer
	# Note: Transformers >= 4.45.0 doubles size of tokenizer.json
	# See https://github.com/huggingface/transformers/issues/34744
	tokenizer.save_pretrained(output)

	print(f'Model is quantized and saved to "{output}"')
	```