alexmarques
commited on
Commit
•
8e4a37a
1
Parent(s):
fa37030
Update README.md
Browse files
README.md
CHANGED
@@ -39,8 +39,6 @@ GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens.
|
|
39 |
|
40 |
## Deployment
|
41 |
|
42 |
-
### Use with vLLM
|
43 |
-
|
44 |
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
|
45 |
|
46 |
```python
|
@@ -71,50 +69,6 @@ print(generated_text)
|
|
71 |
|
72 |
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
73 |
|
74 |
-
### Use with transformers
|
75 |
-
|
76 |
-
The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
|
77 |
-
|
78 |
-
|
79 |
-
```python
|
80 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
81 |
-
|
82 |
-
model_id = "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8"
|
83 |
-
|
84 |
-
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
85 |
-
model = AutoModelForCausalLM.from_pretrained(
|
86 |
-
model_id,
|
87 |
-
torch_dtype="auto",
|
88 |
-
device_map="auto",
|
89 |
-
)
|
90 |
-
|
91 |
-
messages = [
|
92 |
-
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
|
93 |
-
{"role": "user", "content": "Who are you?"},
|
94 |
-
]
|
95 |
-
|
96 |
-
input_ids = tokenizer.apply_chat_template(
|
97 |
-
messages,
|
98 |
-
add_generation_prompt=True,
|
99 |
-
return_tensors="pt"
|
100 |
-
).to(model.device)
|
101 |
-
|
102 |
-
terminators = [
|
103 |
-
tokenizer.eos_token_id,
|
104 |
-
tokenizer.convert_tokens_to_ids("<|eot_id|>")
|
105 |
-
]
|
106 |
-
|
107 |
-
outputs = model.generate(
|
108 |
-
input_ids,
|
109 |
-
max_new_tokens=256,
|
110 |
-
eos_token_id=terminators,
|
111 |
-
do_sample=True,
|
112 |
-
temperature=0.6,
|
113 |
-
top_p=0.9,
|
114 |
-
)
|
115 |
-
response = outputs[0][input_ids.shape[-1]:]
|
116 |
-
print(tokenizer.decode(response, skip_special_tokens=True))
|
117 |
-
```
|
118 |
|
119 |
## Creation
|
120 |
|
|
|
39 |
|
40 |
## Deployment
|
41 |
|
|
|
|
|
42 |
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
|
43 |
|
44 |
```python
|
|
|
69 |
|
70 |
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
## Creation
|
74 |
|