README.md · SamLowe/roberta-base-go_emotions-onnx at f5ede76e8b7f568c5b1766db55261c440b978c1f

metadata

language: en
tags:
  - text-classification
  - onnx
  - int8
  - roberta
  - emotions
  - multi-class-classification
  - multi-label-classification
datasets:
  - go_emotions
license: mit
inference: false
widget:
  - text: Thank goodness ONNX is available, it is lots faster!

This model is the ONNX version of https://huggingface.co/SamLowe/roberta-base-go_emotions.

Full precision ONNX version

onnx/model.onnx is the full precision ONNX version

that has identical accuracy/metrics to the original Transformers model
and has the same model size (499MB)
is faster than inference than normal Transformers, particularly for smaller batch sizes
- in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using ONNXRuntime

Quaantized (INT8) ONNX version

onnx/model_quantized.onnx is the int8 quantized version

that is one quarter the size (125MB) of the full precision model (above)
but delivers almost all of the accuracy
is faster for inference
- about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
- which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)

How to use

Using Optimum Library ONNX Classes

To follow.

Using ONNXRuntime

Tokenization can be done before with the tokenizers library,
and then the fed into ONNXRuntime as the type of dict it uses,
and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.

from tokenizers import Tokenizer
import onnxruntime as ort

from os import cpu_count
import numpy as np  # only used for the postprocessing sigmoid

sentences = ["hello world"]  # for example a batch of 1

# labels as (ordered) list - from the go_emotions dataset
labels = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']

tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")

# Optional - set pad to only pad to longest in batch, not a fixed length.
# (without this, the model will run slower, esp for shorter input strings)
params = {**tokenizer.padding, "length": None}
tokenizer.enable_padding(**params)

tokens_obj = tokenizer.encode_batch(sentences)

def load_onnx_model(model_filepath):
    _options = ort.SessionOptions()
    _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
    _providers = ["CPUExecutionProvider"]  # could use ort.get_available_providers()
    return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)

model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
output_names = [model.get_outputs()[0].name]  # E.g. ["logits"]

input_feed_dict = {
  "input_ids": [t.ids for t in tokens_obj],
  "attention_mask": [t.attention_mask for t in tokens_obj]
}

logits = model.run(output_names=output_names, input_feed=input_feed_dict)[0]
# produces a numpy array, one row per input item, one col per label

def sigmoid(_outputs):
  return 1.0 / (1.0 + np.exp(-_outputs))

# Post-processing. Gets the scores per label in range.
# Auto done by Transformers' pipeline, but we must do it manually with ORT.
model_outputs = sigmoid(logits) 

# for example, just to show the top result per input item
for probas in model_outputs:
  top_result_index = np.argmax(probas)
  print(labels[top_result_index], "with score:", probas[top_result_index])

Example notebook: showing usage, accuracy & performance

Notebook with more details to follow.