SearchUnify-ML
/

xgen-7b-8k-open-instruct-gptq

Text Generation

Transformers

English

llama

text-generation-inference

Model card Files Files and versions Community

SearchUnify-ML commited on Jul 5, 2023

Commit

30ffb0c

•

1 Parent(s): 52c3ddc

added the code snippet to run the model with gptq library

Browse files

Files changed (1) hide show

README.md +20 -36

README.md CHANGED Viewed

@@ -13,68 +13,52 @@ These are GPTQ 4bit model files for [VMWare's XGEN 7B 8K Open Instruct](https://
 It is the result of quantising to 4bit using GPTQ-for-LLaMa.
-The model is open for COMMERCIAL USE.
 # How to use this GPTQ model from Python code
 First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
-#### pip install auto-gptq
-<code>
-from transformers import AutoTokenizer, pipeline, logging
-from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
-import argparse
 model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
 model_basename = "gptq_model-4bit-128g"
 use_triton = False
-tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         model_basename=model_basename,
-        use_safetensors=True,
-        trust_remote_code=False,
         device="cuda:0",
-        use_triton=use_triton,
-        quantize_config=None)
 # Note: check the prompt template is correct for this model.
-prompt = "Tell me about AI"
 prompt_template=f'''### Instruction: {prompt}
 ### Response:'''
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
-output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
-print(tokenizer.decode(output[0]))
-# Inference can also be done using transformers' pipeline
-# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
-logging.set_verbosity(logging.CRITICAL)
-print("*** Pipeline:")
-pipe = pipeline(
-    "text-generation",
-    model=model,
-    tokenizer=tokenizer,
-    max_new_tokens=1024,
-    temperature=0.3,
-    top_p=0.95,
-    repetition_penalty=1.15
-)
-print(pipe(prompt_template)[0]['generated_text'])
-<code>

 It is the result of quantising to 4bit using GPTQ-for-LLaMa.
 # How to use this GPTQ model from Python code
 First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
+```
+pip install auto-gptq
+```
+Second, install tiktoken in order to use tokenizer
+```
+pip install tiktoken
+```
+```
+from transformers import AutoTokenizer, pipeline
+from auto_gptq import AutoGPTQForCausalLM
 model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
 model_basename = "gptq_model-4bit-128g"
 use_triton = False
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         model_basename=model_basename,
+        use_safetensors=False,
+        trust_remote_code=True,
         device="cuda:0",
+        use_triton=use_triton)
 # Note: check the prompt template is correct for this model.
+prompt = "Explain the rules of field hockey to a novice."
 prompt_template=f'''### Instruction: {prompt}
 ### Response:'''
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
+output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512)
+print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")
+```