File size: 3,182 Bytes
d3f780f
 
b8d8a11
 
 
 
 
0358917
cc44d48
 
873af6d
cc44d48
b9dd81c
cc44d48
b9dd81c
cc44d48
b9dd81c
cc44d48
 
8011cbf
cc44d48
d31444e
cc44d48
30ffb0c
 
cc44d48
30ffb0c
cc44d48
b10873c
cc44d48
30ffb0c
 
 
 
 
 
2e1ba24
30ffb0c
8011cbf
 
 
 
 
 
2e1ba24
 
 
8011cbf
 
2e1ba24
 
 
 
 
8011cbf
 
30ffb0c
2e1ba24
8011cbf
 
 
 
 
30ffb0c
 
cc44d48
30ffb0c
cc44d48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: cc
datasets:
- VMware/open-instruct-v1-oasst-dolly-hhrlhf
language:
- en
pipeline_tag: text-generation
inference: false
---

# SearchUnify/xgen-7b-8k-open-instruct-gptq

With its industry-first robust LLM Integrations across its suite of products ([Cognitive Search](https://www.searchunify.com/products/cognitive-search/?utm_source=link&utm_medium=ml-model&utm_campaign=hugging-face), [SUVA](https://www.searchunify.com/products/suva/), [Knowbler](https://www.searchunify.com/products/knowbler/?utm_source=link&utm_medium=ml-model&utm_campaign=hugging-face), [Escalation Predictor](https://applications.searchunify.com/escalation-predictor?utm_source=link&utm_medium=ml-model&utm_campaign=hugging-face), [Agent Helper](https://applications.searchunify.com/agent-helper?utm_source=link&utm_medium=ml-model&utm_campaign=hugging-face) and [Community Helper](https://applications.searchunify.com/community-helper?utm_source=link&utm_medium=ml-model&utm_campaign=hugging-face)) coupled with the federated retrieval augmented generation (FRAG) architecture, [SearchUnify's unified cognitive platform](https://www.searchunify.com/?utm_source=link&utm_medium=ml-model&utm_campaign=hugging-face) fetches relevant information or responses to deliver more accurate and contextually appropriate support and self-service experiences. 

Leveraging the state-of-the-art GPTQ quantization method, SearchUnify optimized the XGen-7B Model for low memory footprint and rapid response generation.

These are GPTQ 4bit model files for [VMWare's XGEN 7B 8K Open Instruct](https://huggingface.co/VMware/xgen-7b-8k-open-instruct). It is the result of quantizing to 4bit using GPTQ-for-LLaMa.


# How to use this GPTQ model from Python code

First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:

```
pip install auto-gptq

```

Second, install tiktoken in order to use the tokenizer

```
pip install tiktoken
```

```

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                          use_fast=False,
                                          trust_remote_code=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
                                           model_basename=model_basename,
                                           use_safetensors=False,
                                           trust_remote_code=True,
                                           device="cuda:0",
                                           use_triton=use_triton)

# Note: check the prompt template is correct for this model.
prompt = "Explain the rules of field hockey to a novice."
prompt_template = f'''### Instruction: {prompt}
### Response:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512)
print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")

```