petermcaughan
commited on
Commit
•
646bb29
1
Parent(s):
f8182d3
Update README.md
Browse files
README.md
CHANGED
@@ -28,7 +28,7 @@ See the [usage instructions](#usage-example) for how to inference this model wit
|
|
28 |
|
29 |
## Performance Comparison
|
30 |
|
31 |
-
#### Latency for
|
32 |
|
33 |
Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
|
34 |
|
@@ -67,13 +67,13 @@ from transformers import AutoConfig, AutoTokenizer
|
|
67 |
sess = InferenceSession("Mistral-7B-v0.1.onnx", providers = ["CUDAExecutionProvider"])
|
68 |
config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
|
69 |
|
70 |
-
|
71 |
|
72 |
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
|
73 |
|
74 |
inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
|
75 |
|
76 |
-
outputs =
|
77 |
|
78 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
79 |
```
|
|
|
28 |
|
29 |
## Performance Comparison
|
30 |
|
31 |
+
#### Latency for token generation
|
32 |
|
33 |
Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
|
34 |
|
|
|
67 |
sess = InferenceSession("Mistral-7B-v0.1.onnx", providers = ["CUDAExecutionProvider"])
|
68 |
config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
|
69 |
|
70 |
+
model = ORTModelForCausalLM(sess, config, use_cache = True, use_io_binding = True)
|
71 |
|
72 |
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
|
73 |
|
74 |
inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
|
75 |
|
76 |
+
outputs = model.generate(**inputs)
|
77 |
|
78 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
79 |
```
|