Onnx model doesn't produce embeddings close enough to SentenceTransformer version

#67
by luciancap001 - opened

I'm attempting to use the Onnx version of the model to speed up inferencing, however I noticed that the values output by the model after applying the pooling & normalization do not match those produced by the model when using the SentenceTransformer version. They also fail to fall within acceptable boundaries of the outputs when using the AutoModel version if I execute each step manually. Specifically, I'm referencing this PyTorch tutorial (https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html) which states the outputs should be within a range of atol = 1e-5 & rtol = 1e-3 when using the torch.isclose() method. Is this any explanation for this discrepancy & how much will this effect results downstream?

do you have a script to test the comparison?

It looks like it might be more than just the Onnx model being off, it appears that even using the AutoModel class will produce different results compared to the SentenceTransformer model's output. Thoughts?

import onnxruntime

import numpy as np
import torch.nn.functional as F

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
from sklearn.datasets import fetch_20newsgroups

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def main():
    
    #Obtain a list of strings to test the model
    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data'][:100]

    #Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

    ####################################################################################################
    # SentenceTransformer Model
    ####################################################################################################

    #Load the model using SentenceTransformer & embed the docs
    sent_trans_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    sent_trans_embeds = sent_trans_model.encode(docs)

    ####################################################################################################
    # Onnx Model
    ####################################################################################################

    # Tokenize sentences
    encoded_input = dict(tokenizer(docs, 
                                padding = True, 
                                truncation = True,
                                return_token_type_ids= False, 
                                return_tensors = 'pt'))

    #Create the Onnxruntime and do the forward pass with it
    ort_session = onnxruntime.InferenceSession("model.onnx", providers = ["CPUExecutionProvider"])
    ort_outs = ort_session.run(None, {k: v.cpu().numpy() for k, v in encoded_input.items()})
    ort_outs = [torch.tensor(i) for i in ort_outs]

    #Perform pooling & normalization on Onnx output
    onnx_embeds = F.normalize(mean_pooling(ort_outs, encoded_input['attention_mask']), p = 2, dim = 1).cpu().numpy()

    print("Onnx model embeddings close to SentenceTransformer embeddings: ", np.allclose(sent_trans_embeds, onnx_embeds, rtol=1e-03, atol=1e-05))

    ####################################################################################################
    # AutoModel
    ####################################################################################################

    #Load the model using the AutoModel class
    model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

    #Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    #Perform pooling & normalization on AutoModel output
    auto_embeds = F.normalize(mean_pooling(model_output, encoded_input['attention_mask']), p = 2, dim = 1).cpu().numpy()

    print("AutoModel embeddings close to SentenceTransformer embeddings: ", np.allclose(sent_trans_embeds, auto_embeds, rtol=1e-03, atol=1e-05))

if __name__ == "__main__":
    main()```

So it turns out the issue apparently was with the tokenizer all along. For some reason the tokenizer loaded with the AutoTokenizer class is using a max_length value of 512 but the model is expecting/was trained with 256. Setting this argument when calling the tokenizer passed the np.allclose() test.

tokenizer(docs, 
                     padding = True, 
                     truncation = True,
                     return_token_type_ids= False, 
                     return_tensors = 'pt',
                     max_length = 256)

Another update: running the Onnx model & passing "CUDAExecutionProvider" instead of "CPUExecutionProvider" when initializing the InferenceSession fails the test. Not sure if that's a result of a issue with the Onnx model itself or with the Onnx runtime

I was having the same problem, thank you so much for this!

@rcland12 "I was having the same problem, thank you so much for this!" you have a problem with CUDAExecutionProvider?

what gpu's do you use for CUDAExecutionProvider?

can you try the following regarding onnx and CUDAExecutionProvider? https://medium.com/@transformergpt/how-to-convert-sentence-transformer-pytorch-models-to-onnx-with-the-right-pooling-method-61b1c83515d2

Sign up or log in to comment