SamLowe's picture
Update README.md
f5ede76
|
raw
history blame
3.95 kB
metadata
language: en
tags:
  - text-classification
  - onnx
  - int8
  - roberta
  - emotions
  - multi-class-classification
  - multi-label-classification
datasets:
  - go_emotions
license: mit
inference: false
widget:
  - text: Thank goodness ONNX is available, it is lots faster!

This model is the ONNX version of https://huggingface.co/SamLowe/roberta-base-go_emotions.

Full precision ONNX version

onnx/model.onnx is the full precision ONNX version

  • that has identical accuracy/metrics to the original Transformers model
  • and has the same model size (499MB)
  • is faster than inference than normal Transformers, particularly for smaller batch sizes
    • in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using ONNXRuntime

Quaantized (INT8) ONNX version

onnx/model_quantized.onnx is the int8 quantized version

  • that is one quarter the size (125MB) of the full precision model (above)
  • but delivers almost all of the accuracy
  • is faster for inference
    • about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
    • which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)

How to use

Using Optimum Library ONNX Classes

To follow.

Using ONNXRuntime

  • Tokenization can be done before with the tokenizers library,
  • and then the fed into ONNXRuntime as the type of dict it uses,
  • and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.
from tokenizers import Tokenizer
import onnxruntime as ort

from os import cpu_count
import numpy as np  # only used for the postprocessing sigmoid

sentences = ["hello world"]  # for example a batch of 1

# labels as (ordered) list - from the go_emotions dataset
labels = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']

tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")

# Optional - set pad to only pad to longest in batch, not a fixed length.
# (without this, the model will run slower, esp for shorter input strings)
params = {**tokenizer.padding, "length": None}
tokenizer.enable_padding(**params)

tokens_obj = tokenizer.encode_batch(sentences)

def load_onnx_model(model_filepath):
    _options = ort.SessionOptions()
    _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
    _providers = ["CPUExecutionProvider"]  # could use ort.get_available_providers()
    return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)

model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
output_names = [model.get_outputs()[0].name]  # E.g. ["logits"]

input_feed_dict = {
  "input_ids": [t.ids for t in tokens_obj],
  "attention_mask": [t.attention_mask for t in tokens_obj]
}

logits = model.run(output_names=output_names, input_feed=input_feed_dict)[0]
# produces a numpy array, one row per input item, one col per label

def sigmoid(_outputs):
  return 1.0 / (1.0 + np.exp(-_outputs))

# Post-processing. Gets the scores per label in range.
# Auto done by Transformers' pipeline, but we must do it manually with ORT.
model_outputs = sigmoid(logits) 

# for example, just to show the top result per input item
for probas in model_outputs:
  top_result_index = np.argmax(probas)
  print(labels[top_result_index], "with score:", probas[top_result_index])

Example notebook: showing usage, accuracy & performance

Notebook with more details to follow.