File size: 6,106 Bytes
9afa0c2 2f8063a 90ee0c1 2f8063a 9afa0c2 c91fe92 2f8063a 9afa0c2 3350320 7fca91d 3350320 5b21a8d 49ebfbf 3350320 c4e1cea 8bfb1d4 3350320 5b21a8d 3350320 c4e1cea 3350320 1d7182a 3350320 82dad17 3350320 26b3c8f 3350320 79ec4c9 3350320 82dad17 26b3c8f 82dad17 5dde3eb f5ede76 26b3c8f a01ed1d 26b3c8f 3350320 82dad17 3350320 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
language: en
tags:
- text-classification
- onnx
- int8
- roberta
- emotions
- multi-class-classification
- multi-label-classification
- optimum
datasets:
- go_emotions
license: mit
inference: false
widget:
- text: Thank goodness ONNX is available, it is lots faster!
---
This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions).
### Full precision ONNX version
`onnx/model.onnx` is the full precision ONNX version
- that has identical accuracy/metrics to the original Transformers model
- and has the same model size (499MB)
- is faster in inference than normal Transformers, particularly for smaller batch sizes
- in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using ONNXRuntime
#### Metrics
Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label:
- Accuracy: 0.474
- Precision: 0.575
- Recall: 0.396
- F1: 0.450
See more details in the SamLowe/roberta-base-go_emotions model card for the increases possible through selecting label-specific thresholds to maximise F1 scores, or another metric.
### Quantized (INT8) ONNX version
`onnx/model_quantized.onnx` is the int8 quantized version
- that is one quarter the size (125MB) of the full precision model (above)
- but delivers almost all of the accuracy
- is faster in inference than both the full precision ONNX above, and the normal Transformers model
- about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
- which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)
#### Metrics for Quantized (INT8) Model
Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label:
- Accuracy: 0.475
- Precision: 0.582
- Recall: 0.398
- F1: 0.447
Note how the metrics are almost identical to the full precision metrics above.
See more details in the SamLowe/roberta-base-go_emotions model card for the increases possible through selecting label-specific thresholds to maximise F1 scores, or another metric.
### How to use
#### Using Optimum Library ONNX Classes
Optimum library has equivalents (starting `ORT`) for the main Transformers classes, so these models can be used with the familiar constructs. The only extra property needed is `file_name` on the model creation, which in the below example specifies the quantized (INT8) model.
```python
sentences = ["ONNX is seriously fast for small batches. Impressive"]
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification
model_id = "SamLowe/roberta-base-go_emotions-onnx"
file_name = "onnx/model_quantized.onnx"
model = ORTModelForSequenceClassification.from_pretrained(model_id, file_name=file_name)
tokenizer = AutoTokenizer.from_pretrained(model_id)
onnx_classifier = pipeline(
task="text-classification",
model=model,
tokenizer=tokenizer,
top_k=None,
function_to_apply="sigmoid", # optional as is the default for the task
)
model_outputs = onnx_classifier(sentences)
# gives a list of outputs, each a list of dicts (one per label)
print(model_outputs)
# E.g.
# [[{'label': 'admiration', 'score': 0.9203393459320068},
# {'label': 'approval', 'score': 0.0560273639857769},
# {'label': 'neutral', 'score': 0.04265536740422249},
# {'label': 'gratitude', 'score': 0.015126707963645458},
# ...
```
#### Using ONNXRuntime
- Tokenization can be done before with the `tokenizers` library,
- and then the fed into ONNXRuntime as the type of dict it uses,
- and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.
```python
from tokenizers import Tokenizer
import onnxruntime as ort
from os import cpu_count
import numpy as np # only used for the postprocessing sigmoid
sentences = ["hello world"] # for example a batch of 1
# labels as (ordered) list - from the go_emotions dataset
labels = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
# Optional - set pad to only pad to longest in batch, not a fixed length.
# (without this, the model will run slower, esp for shorter input strings)
params = {**tokenizer.padding, "length": None}
tokenizer.enable_padding(**params)
tokens_obj = tokenizer.encode_batch(sentences)
def load_onnx_model(model_filepath):
_options = ort.SessionOptions()
_options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
_providers = ["CPUExecutionProvider"] # could use ort.get_available_providers()
return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)
model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
output_names = [model.get_outputs()[0].name] # E.g. ["logits"]
input_feed_dict = {
"input_ids": [t.ids for t in tokens_obj],
"attention_mask": [t.attention_mask for t in tokens_obj]
}
logits = model.run(output_names=output_names, input_feed=input_feed_dict)[0]
# produces a numpy array, one row per input item, one col per label
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
# Post-processing. Gets the scores per label in range.
# Auto done by Transformers' pipeline, but we must do it manually with ORT.
model_outputs = sigmoid(logits)
# for example, just to show the top result per input item
for probas in model_outputs:
top_result_index = np.argmax(probas)
print(labels[top_result_index], "with score:", probas[top_result_index])
```
### Example notebook: showing usage, accuracy & performance
Notebook with more details to follow. |