--- license: cc datasets: - VMware/open-instruct-v1-oasst-dolly-hhrlhf language: - en pipeline_tag: text-generation --- # SearchUnify-ML/xgen-7b-8k-open-instruct-gptq These are GPTQ 4bit model files for [VMWare's XGEN 7B 8K Open Instruct](https://huggingface.co/VMware/xgen-7b-8k-open-instruct). It is the result of quantising to 4bit using GPTQ-for-LLaMa. The model is open for COMMERCIAL USE. # How to use this GPTQ model from Python code First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed: #### pip install auto-gptq from transformers import AutoTokenizer, pipeline, logging from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import argparse model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq" model_basename = "gptq_model-4bit-128g" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, model_basename=model_basename, use_safetensors=True, trust_remote_code=False, device="cuda:0", use_triton=use_triton, quantize_config=None) # Note: check the prompt template is correct for this model. prompt = "Tell me about AI" prompt_template=f'''### Instruction: {prompt} ### Response:''' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline # Prevent printing spurious transformers error when using pipeline with AutoGPTQ logging.set_verbosity(logging.CRITICAL) print("*** Pipeline:") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=1024, temperature=0.3, top_p=0.95, repetition_penalty=1.15 ) print(pipe(prompt_template)[0]['generated_text'])