README.md · SamLowe/roberta-base-go_emotions-onnx at 79ec4c9c340019e3dce899125b8619684feef052

metadata

license: mit

This model is the ONNX version of https://huggingface.co/SamLowe/roberta-base-go_emotions.

Full precision ONNX version

onnx/model.onnx is the full precision ONNX version

that has identical performance to the original transformers model
and has the same model size (499MB)
is faster than inference than normal Transformers, particularly for smaller batch sizes
- in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using OnnxRuntime

Quaantized (INT8) ONNX version

onnx/model_quantized.onnx is the int8 quantized version

that is one quarter the size (125MB) of the full precision model (above)
but delivers almost all of the accuracy
is faster than inference
- about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
- which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)

How to use

Using Optimum Library ONNX Classes

To follow.

Using ONNXRuntime

Tokenization can be done before with the tokenizers library,
and then the fed into ONNXRuntime as the type of dict it uses,
and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.

from tokenizers import Tokenizer
import onnxruntime as ort

from os import cpu_count
import numpy as np  # only used for the postprocessing sigmoid

sentences = ["hello world"]  # for example a batch of 1

tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")

# optional - set pad to only pad to longest in batch, not a fixed length. Without this, the model will run slower, esp for shorter input strings.
params = {**tokenizer.padding, "length": None}
tokenizer.enable_padding(**params)

tokens_obj = tokenizer.encode_batch(sentences)

def load_onnx_model(model_filepath):
    _options = ort.SessionOptions()
    _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
    _providers = ["CPUExecutionProvider"]  # could use ort.get_available_providers()
    return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)

model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
output_names = [model.get_outputs()[0].name]  # E.g. ["logits"]

input_feed_dict = {
  "input_ids": [t.ids for t in tokens_obj],
  "attention_mask": [t.attention_mask for t in tokens_obj]
}

def sigmoid(_outputs):
  return 1.0 / (1.0 + np.exp(-_outputs))

model_output = model.run(output_names=output_names, input_feed=input_feed_dict)[0]

embeddings = sigmoid(model_output)
print(embeddings)

Example notebook: showing usage, accuracy & performance

Notebook with more details to follow.