SamLowe
/

roberta-base-go_emotions-onnx

Text Classification

multi-class-classification

multi-label-classification

Model card Files Files and versions Community

roberta-base-go_emotions-onnx / README.md

SamLowe's picture

Update README.md

79ec4c9 about 1 year ago

|

2.93 kB

	---
	license: mit
	---

	This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions).

	### Full precision ONNX version

	`onnx/model.onnx` is the full precision ONNX version

	- that has identical performance to the original transformers model
	- and has the same model size (499MB)
	- is faster than inference than normal Transformers, particularly for smaller batch sizes
	- in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using OnnxRuntime

	### Quaantized (INT8) ONNX version

	`onnx/model_quantized.onnx` is the int8 quantized version

	- that is one quarter the size (125MB) of the full precision model (above)
	- but delivers almost all of the accuracy
	- is faster than inference
	- about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
	- which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)

	### How to use

	#### Using Optimum Library ONNX Classes

	To follow.

	#### Using ONNXRuntime

	- Tokenization can be done before with the `tokenizers` library,
	- and then the fed into ONNXRuntime as the type of dict it uses,
	- and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.

	```python
	from tokenizers import Tokenizer
	import onnxruntime as ort

	from os import cpu_count
	import numpy as np # only used for the postprocessing sigmoid

	sentences = ["hello world"] # for example a batch of 1

	tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")

	# optional - set pad to only pad to longest in batch, not a fixed length. Without this, the model will run slower, esp for shorter input strings.
	params = {**tokenizer.padding, "length": None}
	tokenizer.enable_padding(**params)

	tokens_obj = tokenizer.encode_batch(sentences)

	def load_onnx_model(model_filepath):
	_options = ort.SessionOptions()
	_options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
	_providers = ["CPUExecutionProvider"] # could use ort.get_available_providers()
	return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)

	model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
	output_names = [model.get_outputs()[0].name] # E.g. ["logits"]

	input_feed_dict = {
	"input_ids": [t.ids for t in tokens_obj],
	"attention_mask": [t.attention_mask for t in tokens_obj]
	}

	def sigmoid(_outputs):
	return 1.0 / (1.0 + np.exp(-_outputs))

	model_output = model.run(output_names=output_names, input_feed=input_feed_dict)[0]

	embeddings = sigmoid(model_output)
	print(embeddings)
	```

	### Example notebook: showing usage, accuracy & performance

	Notebook with more details to follow.