ONNX Model Inference Example

#6
by supreethrao - opened

Hi,
It would be great if there was an example using the onnx version of the model given that the sentence-transformers version requires some transformation of the output of model.encode() to get the vectors.

Thanks!

Nomic AI org

Can sentence transformers be used with Onnx? I am not aware of that. If you wanted to use the onnx model, you can use something like Triton but last time I tried was a bit painful to setup: https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/ONNX/README.md

zpn changed discussion status to closed

I was able to use the ONNX model with Huggingface's Optimum library, like so:

  1. Install all required dependencies for loading the model with Huggingface Transformers, e.g. transformers, torch etc.
  2. Install Huggingface Optimum: pip install optimum[onnxruntime-gpu] - this one if you're using a GPU to run the model
  3. Install sentence-transformers: pip install sentence-transformers
  4. Load the tokenizer and model and perform inference with the model, with mean pooling of embeddings and normalization (skip these if you don't need them):
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained(
    "bert-base-uncased",
    model_max_length=8192
)
model = ORTModelForFeatureExtraction.from_pretrained(
    "nomic-ai/nomic-embed-text-v1",
    file_name="onnx/model.onnx",
    provider="CUDAExecutionProvider", # change this if you want to use a different backend
    trust_remote_code=True,
    rotary_scaling_factor=2
)

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

texts = ["text 1 ....", "text 2 ...."]
inputs = self.tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=8192
)

inputs = inputs.to(torch.device('cuda'))
with torch.no_grad():
    model_output = model(**inputs)
embeddings = mean_pooling(model_output, inputs['attention_mask'])
normalized_embeddings = F.normalize(embeddings, p=2, dim=1).cpu().numpy().tolist()

Sign up or log in to comment