SearchUnify-ML's picture
added the code snippet to run the model with gptq library
30ffb0c
|
raw
history blame
1.61 kB
metadata
license: cc
datasets:
  - VMware/open-instruct-v1-oasst-dolly-hhrlhf
language:
  - en
pipeline_tag: text-generation

SearchUnify-ML/xgen-7b-8k-open-instruct-gptq

These are GPTQ 4bit model files for VMWare's XGEN 7B 8K Open Instruct.

It is the result of quantising to 4bit using GPTQ-for-LLaMa.

How to use this GPTQ model from Python code

First, make sure you have AutoGPTQ installed:

pip install auto-gptq

Second, install tiktoken in order to use tokenizer

pip install tiktoken

from transformers import AutoTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=False,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton)

# Note: check the prompt template is correct for this model.
prompt = "Explain the rules of field hockey to a novice."
prompt_template=f'''### Instruction: {prompt}
### Response:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512)
print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")