SearchUnify-ML's picture
Update README.md
8011cbf
|
raw
history blame
1.98 kB
---
license: cc
datasets:
- VMware/open-instruct-v1-oasst-dolly-hhrlhf
language:
- en
pipeline_tag: text-generation
---
# SearchUnify-ML/xgen-7b-8k-open-instruct-gptq
These are GPTQ 4bit model files for [VMWare's XGEN 7B 8K Open Instruct](https://huggingface.co/VMware/xgen-7b-8k-open-instruct).
It is the result of quantising to 4bit using GPTQ-for-LLaMa.
The model is open for COMMERCIAL USE.
# How to use this GPTQ model from Python code
First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
#### pip install auto-gptq
<code>
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
model_basename = "gptq_model-4bit-128g"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=False,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
# Note: check the prompt template is correct for this model.
prompt = "Tell me about AI"
prompt_template=f'''### Instruction: {prompt}
### Response:'''
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))
# Inference can also be done using transformers' pipeline
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)
print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1024,
temperature=0.3,
top_p=0.95,
repetition_penalty=1.15
)
print(pipe(prompt_template)[0]['generated_text'])
</code>