SearchUnify-ML commited on
Commit
30ffb0c
1 Parent(s): 52c3ddc

added the code snippet to run the model with gptq library

Browse files
Files changed (1) hide show
  1. README.md +20 -36
README.md CHANGED
@@ -13,68 +13,52 @@ These are GPTQ 4bit model files for [VMWare's XGEN 7B 8K Open Instruct](https://
13
 
14
  It is the result of quantising to 4bit using GPTQ-for-LLaMa.
15
 
16
- The model is open for COMMERCIAL USE.
17
 
18
 
19
  # How to use this GPTQ model from Python code
20
 
21
  First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
22
 
23
- #### pip install auto-gptq
 
24
 
 
25
 
26
- <code>
27
 
28
- from transformers import AutoTokenizer, pipeline, logging
29
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
30
- import argparse
 
 
 
 
 
31
 
32
  model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
33
  model_basename = "gptq_model-4bit-128g"
34
 
35
  use_triton = False
36
 
37
- tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
38
 
39
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
40
  model_basename=model_basename,
41
- use_safetensors=True,
42
- trust_remote_code=False,
43
  device="cuda:0",
44
- use_triton=use_triton,
45
- quantize_config=None)
46
 
47
  # Note: check the prompt template is correct for this model.
48
- prompt = "Tell me about AI"
49
  prompt_template=f'''### Instruction: {prompt}
50
  ### Response:'''
51
 
52
  print("\n\n*** Generate:")
53
 
54
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
55
- output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
56
- print(tokenizer.decode(output[0]))
57
-
58
- # Inference can also be done using transformers' pipeline
59
-
60
- # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
61
- logging.set_verbosity(logging.CRITICAL)
62
-
63
- print("*** Pipeline:")
64
- pipe = pipeline(
65
- "text-generation",
66
- model=model,
67
- tokenizer=tokenizer,
68
- max_new_tokens=1024,
69
- temperature=0.3,
70
- top_p=0.95,
71
- repetition_penalty=1.15
72
- )
73
-
74
- print(pipe(prompt_template)[0]['generated_text'])
75
-
76
-
77
- <code>
78
-
79
 
 
80
 
 
13
 
14
  It is the result of quantising to 4bit using GPTQ-for-LLaMa.
15
 
 
16
 
17
 
18
  # How to use this GPTQ model from Python code
19
 
20
  First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
21
 
22
+ ```
23
+ pip install auto-gptq
24
 
25
+ ```
26
 
27
+ Second, install tiktoken in order to use tokenizer
28
 
29
+ ```
30
+ pip install tiktoken
31
+ ```
32
+
33
+ ```
34
+
35
+ from transformers import AutoTokenizer, pipeline
36
+ from auto_gptq import AutoGPTQForCausalLM
37
 
38
  model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
39
  model_basename = "gptq_model-4bit-128g"
40
 
41
  use_triton = False
42
 
43
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)
44
 
45
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
46
  model_basename=model_basename,
47
+ use_safetensors=False,
48
+ trust_remote_code=True,
49
  device="cuda:0",
50
+ use_triton=use_triton)
 
51
 
52
  # Note: check the prompt template is correct for this model.
53
+ prompt = "Explain the rules of field hockey to a novice."
54
  prompt_template=f'''### Instruction: {prompt}
55
  ### Response:'''
56
 
57
  print("\n\n*** Generate:")
58
 
59
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
60
+ output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512)
61
+ print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
+ ```
64