Update README.md
Browse files
README.md
CHANGED
@@ -34,7 +34,7 @@ This optimization reduces the number of bits per parameter from 16 to 8, reducin
|
|
34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
36 |
|
37 |
-
|
38 |
|
39 |
### Use with vLLM
|
40 |
|
@@ -46,7 +46,7 @@ from transformers import AutoTokenizer
|
|
46 |
|
47 |
model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"
|
48 |
|
49 |
-
sampling_params = SamplingParams(temperature=0.
|
50 |
|
51 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
52 |
|
@@ -57,7 +57,7 @@ messages = [
|
|
57 |
|
58 |
prompts = tokenizer.apply_chat_template(messages, tokenize=False)
|
59 |
|
60 |
-
llm = LLM(model=model_id)
|
61 |
|
62 |
outputs = llm.generate(prompts, sampling_params)
|
63 |
|
@@ -65,7 +65,7 @@ generated_text = outputs[0].outputs[0].text
|
|
65 |
print(generated_text)
|
66 |
```
|
67 |
|
68 |
-
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
69 |
|
70 |
## Creation
|
71 |
|
|
|
34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
36 |
|
37 |
+
## Deployment
|
38 |
|
39 |
### Use with vLLM
|
40 |
|
|
|
46 |
|
47 |
model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"
|
48 |
|
49 |
+
sampling_params = SamplingParams(temperature=0.3, top_p=0.9, max_tokens=256)
|
50 |
|
51 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
52 |
|
|
|
57 |
|
58 |
prompts = tokenizer.apply_chat_template(messages, tokenize=False)
|
59 |
|
60 |
+
llm = LLM(model=model_id, max_model_len=4096)
|
61 |
|
62 |
outputs = llm.generate(prompts, sampling_params)
|
63 |
|
|
|
65 |
print(generated_text)
|
66 |
```
|
67 |
|
68 |
+
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
69 |
|
70 |
## Creation
|
71 |
|