--- language: en tags: - text-classification - onnx - int8 - roberta - emotions - multi-class-classification - multi-label-classification datasets: - go_emotions license: mit inference: false widget: - text: Thank goodness ONNX is available, it is lots faster! --- This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions). ### Full precision ONNX version `onnx/model.onnx` is the full precision ONNX version - that has identical accuracy/metrics to the original transformers model - and has the same model size (499MB) - is faster than inference than normal Transformers, particularly for smaller batch sizes - in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using ONNXRuntime ### Quaantized (INT8) ONNX version `onnx/model_quantized.onnx` is the int8 quantized version - that is one quarter the size (125MB) of the full precision model (above) - but delivers almost all of the accuracy - is faster for inference - about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above - which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1) ### How to use #### Using Optimum Library ONNX Classes To follow. #### Using ONNXRuntime - Tokenization can be done before with the `tokenizers` library, - and then the fed into ONNXRuntime as the type of dict it uses, - and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings. ```python from tokenizers import Tokenizer import onnxruntime as ort from os import cpu_count import numpy as np # only used for the postprocessing sigmoid sentences = ["hello world"] # for example a batch of 1 tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions") # optional - set pad to only pad to longest in batch, not a fixed length. Without this, the model will run slower, esp for shorter input strings. params = {**tokenizer.padding, "length": None} tokenizer.enable_padding(**params) tokens_obj = tokenizer.encode_batch(sentences) def load_onnx_model(model_filepath): _options = ort.SessionOptions() _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count() _providers = ["CPUExecutionProvider"] # could use ort.get_available_providers() return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers) model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx") output_names = [model.get_outputs()[0].name] # E.g. ["logits"] input_feed_dict = { "input_ids": [t.ids for t in tokens_obj], "attention_mask": [t.attention_mask for t in tokens_obj] } def sigmoid(_outputs): return 1.0 / (1.0 + np.exp(-_outputs)) model_output = model.run(output_names=output_names, input_feed=input_feed_dict)[0] embeddings = sigmoid(model_output) print(embeddings) ``` ### Example notebook: showing usage, accuracy & performance Notebook with more details to follow.